Meta's AI-based Wikipedia successor 'may be the next big break in NLP'
Don't believe everything you read on the internet
Meta has open-sourced a machine-learning resource that could one day supplant Wikipedia as the world's biggest publicly available knowledge-verification database.
Dubbed Sphere, it can be used to perform knowledge-intensive natural language processing, or KI-NLP, we're told. In practical terms, that means it can be used to answer complicated questions using natural language, and find sources for claims.
A given example of its use is asking Sphere, "Who is Joëlle Sambi Nzeba?" Wikipedia doesn't have an entry for her, but Sphere said she was "born in Belgium and grew up partly in Kinshasa (Congo). She currently lives in Brussels. She is a writer and slammer, alongside her activism in a feminist movement," and links to a website where it got that information about her work.
Wikipedia has pretty much served as the corpus of record, Meta's eggheads wrote in a paper discussing the design of Sphere, claiming the volunteer-maintained uber-wiki is "accurate, well-structured, and small enough to use easily in testing environments."
The tech and social impact of AI's powerful, emerging 'foundation models'READ MORE
Seeking to build something bigger and better than Wikipedia, though, Meta pulled together content from all over the web – sans wikipedia.org – to form a "universal, uncurated and unstructured knowledge source for multiple KI-NLP tasks at once." The result is Sphere, which is more or less a mountain of processed data that can be queried using a bunch of machine-learning tools.
The team adds that Sphere "can match and outperform baselines grounded in Wikipedia" on some tasks using the KILT AI benchmark. That is to say, Sphere performs better than AI systems built on Wikipedia's content.
The primary aim of Sphere was to see what impact replacing Wikipedia, as a source, had on the performance of knowledge-intensive systems, and while the team did report that Sphere had some issues, its performance indicates that, at the very least, it can add value to KI-NLP tasks beyond what Wikipedia corpora can offer.
The researchers behind Sphere claim their work marks "the first time a general purpose search index improves language models on common sense tasks."
Sphere isn't the only AI platform Meta has released on GitHub: last week it released NLLB-200, the first translation AI to pass the 200 language threshold, or so the Facebook parent claimed. Like Sphere, NLLB-200 has been put to use at Wikipedia; the former system for automatically checking citations in edited articles, and the latter to improve translation of pages into less commonly spoken languages.
When transitioning to a web corpus, we no longer have the certainty that any document is good, truthful or unique
Sphere goes beyond similar web corpora in terms of scale, consisting of 906 million passages and 134 million documents. The next largest in terms of passages/documents is the Internet Augmented Dialog generator, which pulls data from 250 million passages and 109 million documents.
But the internet contains no controls for quality or accuracy, which the researchers admit is a key problem for actually deploying this thing. "Using Wikipedia as the knowledge source allows researchers to assume the high quality of the corpus documents. When transitioning to a web corpus, we no longer have the certainty that any document is good, truthful or unique," the researchers wrote.
Sphere's creators think iterative efforts should focus on assessing quality of the data it retrieves, detecting false claims and contradictions, determining how to prioritize trustworthy sources, and when to decide not to answer a question because of a lack of information. You know, making it actually useful.
If it can successfully turn Sphere into a white-box AI with reliable and trustworthy information, Meta said, Sphere "may be the next big break in NLP." ®