This article is more than 1 year old

Search engines don't always help chatbots generate accurate answers

Research shows developers have to find new ways to manipulate information for AI

Access to search engines doesn't tend to improve an AI chatbot's ability to generate accurate and up-to-date answers to queries, which means developers will have to find new techniques to make the interaction more useful, according to research.

Large language models (LLMs) like GPT-3.5 – the basis for ChatGPT – are trained on text scraped from the internet up until September 2021. Companies like Google and Microsoft try to augment LLMs with search engines, giving them access to knowledge in current web pages.

As demonstrated by their respective Bard and Bing chatbots, Google and Microsoft still struggle to produce accurate responses to search queries – even though the correct answer may be on the internet somewhere.

"One might think connecting the search engine and ChatGPT is a perfect solution, but the reality is more challenging because of the limited accuracy of search results," Hongyin Luo, a postdoctoral associate at MIT's Computer Science & Artificial Intelligence Laboratory, told The Register.

Luo explains that search engines are keyword-based retrieval systems and do not always provide direct answers to most questions. Also, different web pages might contain unrelated, contradictory, or false information. Bing incorrectly claimed Adolf Hitler was a member of the band Radiohead in one search result, for example.

Netizens speculated whether the error could have been caused by a page on Wikidata that mentioned Radiohead and Adolf Hitler.

If Bard and Bing are to be useful, developers will need to figure out how to make LLMs extract the most useful information from a sea of text that is noisy, confusing and inconsistent. Luo and his colleagues from MIT and the Chinese University of Hong Kong believe that models need to be fine-tuned further so they can better follow instructions on how to generate responses for web search.

The team tweaked Meta's LLaMA, a seven-billion-parameter LLM, fine-tuning it on a database containing 52,000 pairs of text-based instructions and corresponding responses generated by GPT-4. The researchers also constructed a separate dataset containing the top five web pages associated with each instruction, and trained the model to generate the correct response by ranking the sources on how relevant and closely aligned they were with the right response.

Luo said the fine-tuned model – nicknamed SAIL-7B, which stands for search-augmented instruction learning – is better at ignoring distracting or untrustworthy search results and generates higher quality answers. The details have been published [PDF] in a paper released on arXiv, and the model's code is on GitHub. You can also play with a demo of the system hosted on Hugging Face.

"Our model learns to find helpful information from noisy search results and generate as accurate responses as possible. As a result, our model can better summarize valuable information and generate better answers for various search queries, even when search engines cannot handle them very well," Luo said.

"Our training explicitly includes a step that clarifies if each search result is helpful or not, and the language model follows the selected helpful information. This process filters out most unreliable and unrelated search results and improves the average instruction-following performance."

Initial experiments showed that SAIL-7B outperformed GPT-3.5 and other models containing more parameters at a range of tasks. The experiments assessed their abilities to answer common sense and open-ended questions, as well as fact checking, and detecting hate speech. The models were fed web pages from Wikipedia and search results from DuckDuckGo to help them pick the right answers from a list of candidate responses. GPT-4, however, was still better than SAIL-7B.

"The challenge is that larger models have much stronger knowledge, memorizing and reasoning abilities, so our model is not as good as GPT-4 yet. However, SAIL-7B is a proof of concept with a 'small' model, and our next step is training a larger model with the strategy we have proposed," Luo told us.

Models fine-tuned with the current search-augmented instruction learning technique aren't perfect, however. The researchers noted that they cannot explain why a search result is trustworthy or not. They hope to come up with another strategy to increase accuracy and reliability in the future. ®

More about


Send us news

Other stories you might like