LLMs can write and answer quizzes – but aren't quite ready to disrupt trivia night

Feed AutoQuizzer a URL and it will use LLaMa-3 to make a decent multiple-choice test

A developer has put large language models (LLMs) to the test, literally, by creating AutoQuizzer – a tool that creates quizzes from text on web pages.

The application was made by Stefano Fiorucci – whose day job sees him toil as a software engineer for enterprise AI outfit Deepset – and the code is available on GitHub. Fiorucci also hosts a version of AutoQuizzer on Hugging Face.

Using the app is easy: Feed it a URL, click "Generate quiz," and then prepare to test yourself against an LLM's interpretation of the page's content in a multiple-choice quiz created by the model. The system attempts to generate five questions per page.

In our testing, it only took a second or two to create a quiz, which users can complete themselves or hand back to be answered by the AI system. When the app itself takes the quiz, you have the option to force it into a "closed book exam" mode, in which the model relies just on the page topic, the questions, and any information it was trained on to pick an answer. Alternatively, the AI can be allowed to consider the top three Google search results regarding the topic of the web page. In either mode, the AI code needs a handful of seconds to come up with answers.

Fiorucci explained to The Register that creating AutoQuizzer was actually "very simple," since the components to build it were already available. The app uses Deepset's open source framework Haystack to extract text from a specified page, and pass it to Meta's LLaMa-3-8B-Instruct LLM via Groq's free inference API. The LLaMa neural network is prompted to analyze the text and generate a quiz based on that content in JSON format for the web app to display, and either the user or LLaMa itself can answer.

It's possible to use other, more powerful LLMs, Fiorucci noted, but there are specific reasons why he uses LLaMa-3-8B for AutoQuizzer. Perhaps most importantly, the model – being relatively small and fast – can be used for free via Groq's API, making a free-to-use web-based demo possible.

Other small LLMs didn't pan out. "I tried Phi-3-mini by Microsoft because it has very good performance on benchmarks, despite its small size: it has fewer than 4 billion parameters. Compared to LLaMa-3, it failed to produce valid JSON, and the quiz questions sometimes were too easy or poorly created," Fiorucci said.

Anyone who wants to make their own version of AutoQuizzer with another LLM, such as a more powerful, larger model, can do so; Fiorucci said LLaMa-3 can be swapped out with, say, a member of the GPT family.

Proof of concept – not a trivia night disrupter

In order to stay within the rules of the free Groq API, AutoQuizzer will only send the first 4,000 characters of a web page to the LLM to analyze. Fiorucci told The Register LLaMa-3-8B copes better with sources like Wikipedia articles than it does with news articles. That said, the character limit is more likely to be a problem with Wikipedia pages, which is inconvenient: News stories tend to put the most important information at the start, whereas Wikipedia entries aren't structured as such other than kicking off with a summary.

In The Register's testing, AutoQuizzer usually served up decent questions and suitable answers. Almost every single question had four basic answer choices – with only one question ever offering an "all of the above" option – and all questions were on topic. It can even generate questions in English from a non-English article, though this is not ideal for the LLM and can introduce mistakes.

When we let LLaMa-3-8B answer the quizzes it generated, it usually answered three or four of the five questions correctly when allowed access to Google results – which isn't half bad but is, well, cheating. Also, one might expect an LLM to be able to answer its own questions, given the text-completion nature of these sorts of language models.

We did find some quirks. Some questions had duplicated or very similar answer choices, or answers that didn't completely address the question.

The tool could also miss the point of the content it was asked to consider. This Register article about Microsoft offering relocation opportunities to Chinese employees prompted the AI to ask: "What is the reason for the increased tariffs on electric vehicles in China, according to the article?" The correct answer was: "Due to US President Joe Biden's decision."

Which is kind-of correct, but unrelated to the text AutoQuizzer was asked to consider.

For Fiorucci, however, the point of AutoQuizzer isn't to test LLMs in a unique way or to provide some kind of practical use case. "AutoQuizzer is part of an effort to show how easily you can make both demos and production software using Haystack," he explained, referring to his employer's framework, natch. "Haystack is a powerful open source framework for building applications based on large language models."

"In its current form, AutoQuizzer is a hobby project," he conceded, noting "it could be turned into a library or CLI application. It might also serve as inspiration for creating similar, more refined tools in the educational or entertainment fields."

Given the quality of its output, LLaMa-3-8B probably isn't the right tool for such an application in education or academia, though perhaps a more powerful or a future model would be more usable. Indeed, some coders might already be working to refine AutoQuizzer – GitHub reveals six forks. ®

More about

TIP US OFF

Send us news


Other stories you might like