If you've mastered Python 101, you're probably better at programming than OpenAI's prototype Codex

10 PRINT "ML won't take my job ";: GOTO 10


OpenAI warned that its Codex neural network, like the one that powers GitHub’s code-completion tool Copilot, is likely to generate source that looks plausible but is incorrect, and its performance will decrease as it grows in size.

The artificial intelligence lab revealed the shortcomings and limitations of non-production builds of its Codex model in a pre-print paper this week. It should be noted a distinct production variant of the system powers GitHub Copilot; the preliminary models discussed in the paper are smaller and are only trained on Python whereas Copilot was trained on more data and supports code-completion for a range of programming languages.

Still, GitHub Copilot suffers from similar problems as the prototypes of Codex. Namely, the code generated is unlikely to be correct and useful for developers in its first attempt, and it tends to come up with responses that at first glance appear sensible but may be wrong. Programmers should carefully check the auto-written code for any mistakes.

The Codex language model in the paper was fine-tuned on 159GB of Python source code scraped from GitHub's 50-million-plus public repositories. Auto-generated source and similar junk was removed from the data set.

To test the model’s AI pair-programming skills, researchers came up with 164 handwritten programming problems that examined Codex’s ability to complete functions, understand simple algorithms, and figure out mathematical queries.

The most powerful version of the system with 12 billion parameters was able to solve 28.8 per cent of the problems during its first attempt. For comparison, OpenAI’s GPT-3 natural language system was not able to solve any of them.

Codex does perform better, however, when it’s given the ability to generate more responses. In ten attempts, it came up with the right answer 46.81 per cent of the time, and in 100 attempts, that figure goes up to 72.31 per cent.

In other words, it's up to human programmers or perhaps over tools to do the work of picking out the best suggestion from Codex. “This result suggests that accurate code samples can be selected via heuristic ranking instead of fully evaluating each sample, the latter of which may not be possible or practical in deployment,” the paper stated.

GitHub Copilot seems to perform slightly better; it can generate correct code 43 per cent of the time on the first try, and 57 per cent of the time when allowed 10 attempts. If you were worried about these code-completion tools replacing human programmers, don’t worry as they pose no real threat for now and are just simple pattern-matching machines. Good for generating boilerplate code and stuff like that.

The human touch

The researchers admitted that “a strong student who completes an introductory computer-science course is expected to be able to solve a larger fraction of problems than Codex,” even though it has seen more code than a professional developer will ever see in their lifetime.

Codex tends to replicate common coding samples that it was trained on; if you write something that looks similar, it will fill in the blanks with what it thinks should go next though the code generated is often not quite right. If you’re writing something that is more specialized for a particular application or is more complex than most scripts, Codex will not be as useful.

“We find that Codex can recommend syntactically incorrect or undefined code, and can invoke functions,variables, and attributes that are undefined or outside thescope of the codebase. Moreover, Codex struggles to parse through increasingly long and higher-level or system-level specifications,” the paper stated.

This problem only gets worse as the models become larger, the paper stated. Codex is also only as good as a programmer as you are, unfortunately. If you feed it prompts that contain subtle bugs, it’ll “tend to produce worse code than it is capable of. This persists when the prompt also includes instructions to write correct code. This gap increases with model size,” the researchers wrote.

They also warned that like other language models, Codex can also be coaxed into generating “racist, denigratory, and otherwise harmful outputs” as code comments. Biases in gender or race have also been seen in code structures. To prevent harm in the real world, GitHub Copilot comes with filters that automatically block offensive words so it can’t spit out toxic outputs.

It's not all doom and gloom, don't get us wrong. OpenAI said it wants to focus on the potential positive impacts the tool might have, like whether it makes programmers more productive or if it can encourage them to document their code better for others to read. ®

Similar topics

Broader topics


Other stories you might like

  • Snowflake stock drops as some top customers cut usage
    You might say its valuation is melting away

    IPO darling Snowflake's share price took a beating in an already bearish market for tech stocks after filing weaker than expected financial guidance amid a slowdown in orders from some of its largest customers.

    For its first quarter of fiscal 2023, ended April 30, Snowflake's revenue grew 85 percent year-on-year to $422.4 million. The company made an operating loss of $188.8 million, albeit down from $205.6 million a year ago.

    Although surpassing revenue expectations, the cloud-based data warehousing business saw its valuation tumble 16 percent in extended trading on Wednesday. Its stock price dived from $133 apiece to $117 in after-hours trading, and today is cruising back at $127. That stumble arrived amid a general tech stock sell-off some observers said was overdue.

    Continue reading
  • Amazon investors nuke proposed ethics overhaul and say yes to $212m CEO pay
    Workplace safety, labor organizing, sustainability and, um, wage 'fairness' all struck down in vote

    Amazon CEO Andy Jassy's first shareholder meeting was a rousing success for Amazon leadership and Jassy's bank account. But for activist investors intent on making Amazon more open and transparent, it was nothing short of a disaster.

    While actual voting results haven't been released yet, Amazon general counsel David Zapolsky told Reuters that stock owners voted down fifteen shareholder resolutions addressing topics including workplace safety, labor organizing, sustainability, and pay fairness. Amazon's board recommended voting no on all of the proposals.

    Jassy and the board scored additional victories in the form of shareholder approval for board appointments, executive compensation and a 20-for-1 stock split. Jassy's executive compensation package, which is tied to Amazon stock price and mostly delivered as stock awards over a multi-year period, was $212 million in 2021. 

    Continue reading
  • Confirmed: Broadcom, VMware agree to $61b merger
    Unless anyone out there can make a better offer. Oh, Elon?

    Broadcom has confirmed it intends to acquire VMware in a deal that looks set to be worth $61 billion, if it goes ahead: the agreement provides for a “go-shop” provision under which the virtualization giant may solicit alternative offers.

    Rumors of the proposed merger emerged earlier this week, amid much speculation, but neither of the companies was prepared to comment on the deal before today, when it was disclosed that the boards of directors of both organizations have unanimously approved the agreement.

    Michael Dell and Silver Lake investors, which own just over half of the outstanding shares in VMware between both, have apparently signed support agreements to vote in favor of the transaction, so long as the VMware board continues to recommend the proposed transaction with chip designer Broadcom.

    Continue reading

Biting the hand that feeds IT © 1998–2022