This article is more than 1 year old

Even complex AI models are failing 5th grade science

Even the sharpest of models fail to melt ice or build a circuit

Think your AI agents are actually learning to solve problems? A new benchmark sheds light on what is real when it comes to sophisticated AI.

Researchers from the University of Arizona, Microsoft, and the Allen Institute for AI tested several different state-of-the-art agents and found them readily able to answer the "what" of a situation, but incapable of determining the "how" of them.

The agents were put to the test using a benchmark built especially for the task that the researchers called ScienceWorld. ScienceWorld will be immediately familiar to anyone who's played an old-school, text-based MUD: it has multiple rooms, objects that can be interacted with, and tasks to be performed. In this case, it's not so much killing goblins, but completing the equivalent of elementary-level science projects.

ScienceWorld has simulation engines for thermodynamics, electrical circuits, matter and chemistry reactions, and biological processes. The researchers got their list of experiments for the agents by turning typical science test questions into experiments, and then testing the agents to see if they could reason to the answer.

The fact that agents can quickly answer "whats," but not "hows," raises "the question of whether current models are simply retrieving answers by way of seeing a large number of similar input examples or if they have learned to reason about concepts in a reusable manner," the researchers said. 

Not to spoil it, but there's no reasoning going on inside those digital brains.

ScienceWorld's builders were looking for digital agents to do one thing in particular – "combine declarative scientific and world knowledge with the procedural knowledge required to correctly complete the experiment."

In one experiment, agents were tested to see if they could identify a fork, find the necessary materials needed to test it for conductivity, and then put it in the correct box. 

Another experiment had agents trying to determine if an ice cube would melt on a stove. Again, this requires the agent to identify, pick up, and manipulate several objects inside ScienceWorld. As an added level of challenge in all situations, various properties of objects inside ScienceWorld (location, color, etc.) change each time the simulation is started to prevent agents from simply memorizing a sequence.

Scoring of the 30 different tasks in ScienceWorld is based on a scale of 0.00, a total failure, to 1, indicating perfect performance. The highest score for any AI under test was 0.54, and that was on one of the simplest: identifying a non-living thing. For the ice, the best was 0.04.

In fact, a random-action generator stood out, with 0.63 for identifying a non-living thing. Building circuits was also abysmal. Virtually all the scores were low. This led the academics to conclude:

Agents for text-based games as well as novel models adapted from transformer-based scientific question-answering solvers perform poorly on tasks (such as melting ice) that 5th grade science students can perform with ease

Fifth grade being kids aged 10 to 11, typically, in the USA.

"Overall, these tasks are challenging for current models, with the best model (DRRN) achieving an average score of 0.18 across all 30 subtasks," the paper said. So, which models performed best? Even that's a tricky question to answer.

The researchers found models using valid action detection aid tended to perform better than those that must first learn to generate valid actions, and models that used large language model components for action selection tended to perform more poorly.

Interactive reinforcement learning models were able to quickly identify and classify objects, but had difficulty picking objects up and putting them in the right box. Open-ended tasks, like those requiring the bot to change the state of an object, were difficult for all the models.

The biggest takeaway from the project comes from another finding – that agents with larger models don't necessarily perform better. The DRRN model only had 1.5 million parameters, which was four orders of magnitude fewer than the pair of T5 models used in the experiment, yet DRRN performed better.

"Our results also suggest that agents that learn interactively in a grounded environment are more sample and parameter efficient than large language models that learn offline by reading text from static sources," the report concludes. ®

More about


Send us news

Other stories you might like