This article is more than 1 year old

Study finds AI assistants help developers produce code that's more likely to be buggy

At the same time, tools like Github Copilot and Facebook InCoder make developers believe their code is sound

Computer scientists from Stanford University have found that programmers who accept help from AI tools like Github Copilot produce less secure code than those who fly solo.

In a paper titled, "Do Users Write More Insecure Code with AI Assistants?", Stanford boffins Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh answer that question in the affirmative.

Worse still, they found that AI help tends to delude developers about the quality of their output.

"We found that participants with access to an AI assistant often produced more security vulnerabilities than those without access, with particularly significant results for string encryption and SQL injection," the authors state in their paper. "Surprisingly, we also found that participants provided access to an AI assistant were more likely to believe that they wrote secure code than those without access to the AI assistant."

Previously, NYU researchers have shown that AI-based programming suggestions are often insecure in experiments under different conditions. The Stanford authors point to an August 2021 research paper titled "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions," which found that given 89 scenarios, about 40 per cent of the computer programs made with the help of Copilot had potentially exploitable vulnerabilities.

That study, the Stanford authors say, is limited in scope because it only considers a constrained set of prompts corresponding to 25 vulnerabilities and just three programming languages: Python, C, and Verilog.

The Stanford scholars also cite a followup study from some of the same NYU eggheads, "Security Implications of Large Language Model Code Assistants: A User Study," as the only comparable user study they're aware of. They observe, however, that their work differs because it focuses on OpenAI's codex-davinci-002 model rather than OpenAI’s less powerful codex-cushman-001 model, both of which play a role in GitHub Copilot, itself a fine-tuned descendant of a GPT-3 language model.

Also, the Stanford study looks at multiple programming languages (Python, Javascript, and C) while the "Security Implications…" paper focuses just on functions in the C programming language. The Stanford researchers suggest that the inconclusive findings reported in the "Security Implications" paper may follow from the narrow focus on C, which they said was the only language in their broader study with mixed results.

The Stanford user study involved 47 people with varying levels of experience, including undergraduate students, graduate students, and industry professionals. Participants were asked to write code in response to five prompts using a standalone React-based Electron app monitored by the study administrator. The first prompt was, "Write two functions in Python where one encrypts and the other decrypts a given string using a given symmetric key."

For that particular question, those relying on AI assistance were more likely to write incorrect and insecure code than the control group working without automated help. Only 67 percent of the assisted group gave a correct answer while 79 percent of the control group did so.

And those in the assisted group were "significantly more likely to provide an insecure solution (p < 0.05, using Welch’s unequal variances t-test), and also significantly more likely to use trivial ciphers, such as substitution ciphers (p < 0.01), and not conduct an authenticity check on the final returned value."

The results were similar for questions two through four. Question five, "Write a function in C that takes a signed integer num and returns a string representation of that integer," produced mixed results.

"While the results are inconclusive as to whether the AI assistant helped or harmed participants, we do observe that participants in the [AI assistant] group were significantly more likely to introduce integer overflow mistakes in their solutions (p < 0.02)," the Stanford researchers said.

The authors conclude that AI assistants should be viewed with caution because they can mislead inexperienced developers and create security vulnerabilities.

At the same time, they hope their findings will lead to improvements in the way AI assistants are designed because they have the potential to make programmers more productive, to lower barriers to entry, and to make software development more accessible to those who dislike the hostility of internet forums.

As one study participant is said to have remarked about AI assistance, "I hope this gets deployed. It’s like StackOverflow but better because it never tells you that your question was dumb." ®

More about


Send us news

Other stories you might like