Top LLMs struggle to make accurate legal arguments

AI can't cite cases, fully grok the law, or reason about it effectively, study finds

Interview If you think generative AI has an automatic seat at the table in the world of law, think again.

Top large language models tend to generate inaccurate legal information and should not be relied upon for litigation, fresh research has shown.

Last year, when OpenAI showed GPT-4 was capable of passing the Bar Exam, it was heralded as a breakthrough in AI and led some people to question whether the technology could soon replace lawyers. Some hoped these types of models could empower people who can't afford expensive attorneys to pursue legal justice, making access to legal help more equitable. The reality, however, is that LLMs can't even assist professional lawyers effectively, according to a recent study.

The biggest concern is that AI often fabricates false information, posing a huge problem especially in an industry that relies on factual evidence. A team of researchers at Yale and Stanford University analyzing the rates of hallucination in popular large language models found that they often do not accurately retrieve or generate relevant legal information, or understand and reason about various laws.

In fact, OpenAI's GPT-3.5, which currently powers the free version of ChatGPT, hallucinates about 69 percent of the time when tested across different tasks. The results were worse for PaLM-2, the system that was previously behind Google's Bard chatbot, and Llama 2, the large language model released by Meta, which generated falsehoods at rates of 72 and 88 percent, respectively.

Unsurprisingly, the models struggle to complete more complex tasks as opposed to than easier ones. Asking AI to compare different cases and see whether they agree upon an issue, for example, is challenging, and it will more likely generate inaccurate information than when faced with an easier task, such as checking which court a case was filed in. 

Although LLMs excel at processing large amounts of text, and can be trained on huge amounts of legal documents – more than any human lawyer could read in their lifetime – they don't understand law and can't form sound arguments.

"While we've seen these kinds of models make really great strides in forms of deductive reasoning in coding or math problems, that is not the kind of skill set that characterizes top notch lawyering," Daniel Ho, co-author of the Yale-Stanford paper, tells The Register.

"What lawyers are really good at, and where they excel is often described as a form of analogical reasoning in a common law system, to reason based on precedents," added Ho, who is faculty associate director of the Stanford Institute for Human-Centered Artificial Intelligence.

Machines often fail in simple tasks too. When asked to inspect a name or citation to check whether a case is real, GPT-3.5, PaLM-2, and Llama 2 can make up fake information in responses.

"The model doesn't need to know anything about the law honestly to answer that question correctly. It just needs to know whether or not a case exists or not, and can see that anywhere in the training corpus," Matthew Dahl, a PhD law student at Yale University, says.

It shows that AI cannot even retrieve information accurately, and that there's a fundamental limit to the technology's capabilities. These models are often primed to be agreeable and helpful. They usually won't bother correcting users' assumptions, and will side with them instead. If chatbots are asked to generate a list of cases in support of some legal argument, for example, they are more predisposed to make up lawsuits than to respond with nothing. A pair of attorneys learned this the hard way when they were sanctioned for citing cases that were completely invented by OpenAI's ChatGPT in their court filing.

The researchers also found the three models they tested were more likely to be knowledgeable in federal litigation related to the US Supreme Court compared to localized legal proceedings concerning smaller and less powerful courts. 

Since GPT-3.5, PaLM-2, and Llama 2 were trained on text scraped from the internet, it makes sense that they would be more familiar with the US Supreme Court's legal opinions, which are published publicly compared to legal documents filed in other types of courts that are not as easily accessible. 

They also were more likely to struggle in tasks that involved recalling information from old and new cases. 

"Hallucinations are most common among the Supreme Court's oldest and newest cases, and least common among its post-war Warren Court cases (1953-1969)," according to the paper. "This result suggests another important limitation on LLMs' legal knowledge that users should be aware of: LLMs' peak performance may lag several years behind the current state of the doctrine, and LLMs may fail to internalize case law that is very old but still applicable and relevant law."

Too much AI could create a 'monoculture'

The researchers were also concerned that overreliance on these systems could create a legal "monoculture." Since AI is trained on a limited amount of data, it will refer to more prominent, well-known cases leading lawyers to ignore other legal interpretations or relevant precedents. They may overlook other cases that could help them see different perspectives or arguments, which could prove crucial in litigation. 

"The law itself is not monolithic," Dahl says. "A monoculture is particularly dangerous in a legal setting. In the United States, we have a federal common law system where the law develops differently in different states in different jurisdictions. There's sort of different lines or trends of jurisprudence that develop over time."

"It could lead to erroneous outcomes and unwarranted reliance in a way that could actually harm litigants" Ho adds. He explained that a model could generate inaccurate responses to lawyers or people looking to understand something like eviction laws. 

"When you seek the help of a large language model, you might be getting the exact wrong answer as to when is your filing due or what is the kind of rule of eviction in this state," he says, citing an example. "Because what it's telling you is the law in New York or the law of California, as opposed to the law that actually matters to your particular circumstances in your jurisdiction."

The researchers conclude that the risks of using these types of popular models for legal tasks is highest for those submitting paperwork in lower courts across smaller states, particularly if they have less expertise and are querying the models based on false assumptions. These people are more likely to be lawyers, who are less powerful from smaller law firms with fewer resources, or people looking to represent themselves.

"In short, we find that the risks are highest for those who would benefit from LLMs most," the paper states. ®

More about


Send us news

Other stories you might like