Boffins find AI stumbles when quizzed on the tough stuff
Must try harder, D+
AI models can manage well enough when prompted with text or images, and may even solve complex problems when not making terrible errors.
OpenAI, for example, has said that its GPT-4 model managed to score 700 out of 800 on the SAT math exam. Not all such claims have borne out, however: A paper released in June that said GPT-4 could get a computer science degree at MIT was subsequently withdrawn.
So to better assess how large language models – which interpret text input – and large multimodal models – which interpret text, images and perhaps other forms of input – actually handle problem solving, a group of ten researchers from the University of California, Los Angeles, the University of Washington, and Microsoft Research have devised a testing benchmark called MathVista that focuses on visually-oriented challenges.
"The ability of these foundation models to perform mathematical reasoning in visual contexts has not been systematically examined," say the authors – Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao, in a preprint paper [PDF].
It is thus essential, they say, to develop a new benchmark to help the development of mathematical reasoning with a visual component and to evaluate how various models compare at reasoning tasks.
Being able to show that one's AI model can correctly solve visual problems may prove helpful in determining whether it's wise to, say, trust software to drive a car without stopping atop an accident victim.
MathVista incorporates 6,141 examples that were developed from 28 multimodal datasets and from 3 new datasets called IQTest, FunctionQA, and PaperQA. It covers various forms of reasoning (algebraic, arithmetic, geometric, logical, numeric, scientific, and statistical), with a focus on figure question answering, geometry problem solving, math word problems, textbook questions, and visual questions.
The researchers tested a dozen foundation models: three LLMs ChatGPT, GPT-4, and Claude-2), two proprietary LMMs (GPT4V and Bard), and seven open-source LMMs. They also considered human answers, provided via Amazon Mechanical Turkers with at least a high school degree, and random responses.
- AWS CEO talks up AI to focus minds of Wall Street types
- Clippy-like AI at forefront of Windows update previews
- Bug bounty hunters load up to stalk AI and fancy bagging big bucks
- How prompt injection attacks hijack today's top-end AI – and it's tough to fix
The good news for AI practitioners is that the LLMs and LMMs all did better than random chance, which isn't all that surprising considering that many of the questions were multiple choice rather than yes or no.
In fact, the top performer, OpenAI's GPT-4V, managed to surpass human performance in specific areas – questions involving algebraic reasoning and complex visual challenges involving tables and function plots.
We note that Microsoft, whose researchers contributed to this project, has a substantial stake in OpenAI.
The less good news is that even GPT-4V only managed to get 49.9 percent of the questions correct. That's adequate if the goal is to best multimodal Bard, which managed an accuracy percentage of 34.8 percent.
But it's still shy of the Amazon Mechanical Turk workers who were put to the test and managed a score of 60.3 percent. As the researchers observe in their paper, "a 10.4 percent gap in overall accuracy remains when compared to the human baseline, leaving plenty of room for model improvement." ®