OpenAI warned that its Codex neural network, like the one that powers GitHub’s code-completion tool Copilot, is likely to generate source that looks plausible but is incorrect, and its performance will decrease as it grows in size.
The artificial intelligence lab revealed the shortcomings and limitations of non-production builds of its Codex model in a pre-print paper this week. It should be noted a distinct production variant of the system powers GitHub Copilot; the preliminary models discussed in the paper are smaller and are only trained on Python whereas Copilot was trained on more data and supports code-completion for a range of programming languages.
Still, GitHub Copilot suffers from similar problems as the prototypes of Codex. Namely, the code generated is unlikely to be correct and useful for developers in its first attempt, and it tends to come up with responses that at first glance appear sensible but may be wrong. Programmers should carefully check the auto-written code for any mistakes.
The Codex language model in the paper was fine-tuned on 159GB of Python source code scraped from GitHub's 50-million-plus public repositories. Auto-generated source and similar junk was removed from the data set.
To test the model’s AI pair-programming skills, researchers came up with 164 handwritten programming problems that examined Codex’s ability to complete functions, understand simple algorithms, and figure out mathematical queries.
The most powerful version of the system with 12 billion parameters was able to solve 28.8 per cent of the problems during its first attempt. For comparison, OpenAI’s GPT-3 natural language system was not able to solve any of them.
Codex does perform better, however, when it’s given the ability to generate more responses. In ten attempts, it came up with the right answer 46.81 per cent of the time, and in 100 attempts, that figure goes up to 72.31 per cent.
- GitHub Copilot is AI pair programming where you, the human, still have to do most of the work
- GitHub Copilot auto-coder snags emerge, from seemingly spilled secrets to bad code, but some love it
- Microsoft puts OpenAI's GPT-3 that it spent all that money on to work in Power Fx
- How to hide a backdoor in AI software – such as a bank app depositing checks or a security cam checking faces
In other words, it's up to human programmers or perhaps over tools to do the work of picking out the best suggestion from Codex. “This result suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample, the latter of which may not be possible or practical in deployment,” the paper stated.
GitHub Copilot seems to perform slightly better; it can generate correct code 43 per cent of the time on the first try, and 57 per cent of the time when allowed 10 attempts. If you were worried about these code-completion tools replacing human programmers, don’t worry as they pose no real threat for now and are just simple pattern-matching machines. Good for generating boilerplate code and stuff like that.
The human touch
The researchers admitted that “a strong student who completes an introductory computer-science course is expected to be able to solve a larger fraction of problems than Codex,” even though it has seen more code than a professional developer will ever see in their lifetime.
Codex tends to replicate common coding samples that it was trained on; if you write something that looks similar, it will fill in the blanks with what it thinks should go next though the code generated is often not quite right. If you’re writing something that is more specialized for a particular application or is more complex than most scripts, Codex will not be as useful.
“We find that Codex can recommend syntactically incorrect or undefined code, and can invoke functions,variables, and attributes that are undefined or outside thescope of the codebase. Moreover, Codex struggles to parse through increasingly long and higher-level or system-level specifications,” the paper stated.
This problem only gets worse as the models become larger, the paper stated. Codex is also only as good as a programmer as you are, unfortunately. If you feed it prompts that contain subtle bugs, it’ll “tend to produce worse code than it is capable of. This persists when the prompt also includes instructions to write correct code. This gap increases with model size,” the researchers wrote.
They also warned that like other language models, Codex can also be coaxed into generating “racist, denigratory, and otherwise harmful outputs” as code comments. Biases in gender or race have also been seen in code structures. To prevent harm in the real world, GitHub Copilot comes with filters that automatically block offensive words so it can’t spit out toxic outputs.
It's not all doom and gloom, don't get us wrong. OpenAI said it wants to focus on the potential positive impacts the tool might have, like whether it makes programmers more productive or if it can encourage them to document their code better for others to read. ®