Perhaps AI is going to take away coding jobs – of those who trust this tech too much

Llama 2 avoids errors by staying quiet, GPT-4 gives long, if useless, samples

Computer scientists have evaluated how large language models (LLMs) answer Java coding questions from the Q&A site StackOverflow and, like others before them, have found the results wanting.

In a preprint paper titled, "A Study on Robustness and Reliability of Large Language Model Code Generation," doctoral students Li Zhong and Zilong Wang describe how they gathered 1,208 coding questions from StackOverflow involving 24 common Java APIs, and then evaluated answers provided by four different code-capable LLMs based on their API checker called RobustAPI.

RobustAPI is designed to assess code reliability, which the academics define as resistance to failure and unexpected input, as well as tolerance of high workloads. The assumption is that deviating from the API rules can have consequences for code running in production environments.

They argue that code tests, whether written by people or machines, focus only on semantic correctness and may not create a testing environment where unexpected input can be checked. To address this, they relied on static analysis to look at code structure without running tests, which they argue ensures full coverage. The API checker traverses the Abstract Syntax Tree to record the call methods and control structures to determine the call sequence, then checks that against the API usage rules.

As example, this snippet would be flagged for failing to enclose the code in a try-catch block to handle failures.

RandomAccessFile raf = new RandomAccessFile("/tmp/file.json", "r"); 
byte[] buffer = new byte[1024 * 1024]; 
int bytesRead =, 0, buffer.length); 

The boffins, based at the University of California San Diego, tested GPT-3.5 and GPT-4 from OpenAI, and two more open models, Meta's Llama 2 and Vicuna-1.5 from the Large Model Systems Organization. And they did so with three different tests for their set of questions: zero-shot, in which no example of proper API usage was provided in the input prompt; one-shot-irrelevant, in which the provided example is irrelevant to the question; and one-shot-relevant, in which an example of correct API usage is provided in the prompt.

The models exhibited overall API misuse rates for the zero-shot test as follows: GPT-3.5 (49.83 percent); GPT-4 (62.09 percent); Llama 2 (0.66 percent); and Vicuna-1.5 (16.97).

That would seem to suggest Llama 2 aced the test, with a failure rate of less than one percent. But that's a misinterpretation of the results – Llama's lack of failures comes from not suggesting much code.

"From the evaluation results, all the evaluated models suffer from the API misuse problems, even for the state-of-the-art commercial models like GPT-3.5 and GPT-4," observe Zhong and Wang. "In zero-shot settings, Llama has the lowest API misuse rate. However, this is partially due to [the fact that] most of the Llama answers do not include any code."

Counterintuitively, they say, GPT-4 has a higher API misuse rate than GPT-3.5 because, as OpenAI has claimed, it's more capable of responding to prompts with source code. The issue is that GPT-4's responses aren't necessarily correct.

For the one-shot-irrelevant test, the misuse scores were: GPT-3.5 (62.00 percent); GPT-4 (64.34 percent); Llama 2 (49.17 percent); and Vicuna-1.5 (48.51 percent).

"API misuse rate of Llama increases significantly after adding the irrelevant shot because it has more valid answers that contain code snippets," the researchers say. "Overall, adding an irrelevant shot triggers the large language models to generate more valid answers, which enables a better evaluation on the code reliability and robustness."

And when spoon-fed with proper API usage on the one-shot-relevant test, the LLMs performed as follows: GPT-3.5 (31.13 percent); GPT-4 (49.17 percent); Llama 2 (47.02 percent); and Vicuna-1.5 (27.32). That is to say, explaining question requirements leads to less API misuse.

But overall, the boffins find there's work to be done because it's not enough to generate code. The code also has to be reliable, something which despite developer affinity for chatbot code assistants, remains a widely noted problem.

"This indicates that with the code generation ability of large language models [is] largely improved nowadays, [while] the reliability and robustness of code in real-world production rises as an unnoticed issue. And the space for improvement is huge for this problem." ®

More about


Send us news

Other stories you might like