AI-generated bug reports are seriously annoying for developers

Hallucinated programming flaws vex curl project

Generative AI models like Google Bard and GitHub Copilot have a user problem: Those who rely on software assistance may not understand or care about the limitations of these machine learning tools.

This has come up in various industries. Lawyers have been sanctioned for citing cases invented by chatbots in their legal filings. Publications have been pilloried for articles attributed to fake authors. And ChatGPT-generated medical content is about 7 percent accurate.

Though AI models have demonstrated utility for software development, they still get many things wrong. Attentive developers can mitigate these shortcomings but that doesn't always happen – due to ignorance, indifference, or ill-intent. And when AI is allowed to make a mess, the cost of cleanup is shifted to someone else.

On Tuesday, Daniel Stenberg, the founder and lead developer of widely used open source projects curl and libcurl, raised this issue in a blog post in which he describes the rubbish problem created by cavalier use of AI for security research.

The curl project offers a bug bounty to security researchers who find and report legitimate vulnerabilities. According to Stenberg, the program has paid out over $70,000 in rewards to date. Of 415 vulnerability reports received, 64 have been confirmed as security flaws and 77 have been deemed informative – bugs without obvious security implications. So about 66 percent of the reports have been invalid.

The issue for Stenberg is that these reports still need to be investigated and that takes developer time. And while those submitting bug reports have begun using AI tools to accelerate the process of finding supposed bugs and writing up reports, those reviewing bug reports still rely on human review. The result of this asymmetry is more plausible-sounding reports, because chatbot models can produce detailed, readable text without regard to accuracy.

As Stenberg puts it, AI produces better crap.

"The better the crap, the longer time and the more energy we have to spend on the report until we close it," he wrote. "A crap report does not help the project at all. It instead takes away developer time and energy from something productive. Partly because security work is considered one of the most important areas so it tends to trump almost everything else."

As examples, he cites two reports submitted to HackerOne, a vulnerability reporting community. One claimed to describe Curl CVE-2023-38545 prior to actual disclosure. But Stenberg had to post to the forum to make clear that the bug report was bogus.

He said that the report, produced with the help of Google Bard, "reeks of typical AI style hallucinations: it mixes and matches facts and details from old security issues, creating and making up something new that has no connection with reality."

The other report, submitted last week, claimed to have found a Buffer Overflow Vulnerability in WebSocket Handling. After posting a series of questions to the forum and receiving dubious answers from the bug reporting account, Stenberg concluded no such flaw existed and suspected that he had been conversing with an AI model.

"After repeated questions and numerous hallucinations I realized this was not a genuine problem and on the afternoon that same day I closed the issue as not applicable," he wrote. "There was no buffer overflow."

He added, "I don’t know for sure that this set of replies from the user was generated by an LLM but it has several signs of it."

Stenberg readily acknowledges that AI assistance can be genuinely helpful. But he argues that having a human in the loop makes the use and outcome of AI tools much better. Even so, he expects the ease and utility of these tools, coupled with the financial incentive of bug bounties, will lead to more shoddy LLM-generated security reports, to the detriment of those on the receiving end.

Feross Aboukhadijeh, CEO of security biz Socket, echoed Stenberg's observations.

"There are many positive ways that LLMs are being used to help defenders, but unfortunately LLMs also help attackers in a few key ways," said Aboukhadijeh in an email to The Register. "Already, we’re seeing LLMs be used to help attackers send more convincing spam and even craft targeted spear-phishing attacks at scale. Yet, it's important to note that even Daniel recognizes the enormous positive potential of LLMs, specifically to help find security vulnerabilities."

Aboukhadijeh said Socket has been using LLMs in conjunction with human reviewers to detect vulnerable malicious open source packages in the JavaScript, Python, and Go ecosystems.

"The human review is absolutely critical to reduce false positives," he said. "Without human review, the system has a 67 percent false positive rate. With humans in the loop, it’s closer to 1 percent. Today, Socket detects around 400 malicious packages per week." ®

More about

TIP US OFF

Send us news


Other stories you might like