A team within Google Brain – the web giant's crack machine-learning research lab – has taught software to generate Wikipedia-style articles by summarizing information on web pages... to varying degrees of success.
As we all know, the internet is a never ending pile of articles, social media posts, memes, joy, hate, and blogs. It’s impossible to read and keep up with everything. Using AI to tell pictures of dogs and cats apart is cute and all, but if such computers could condense information down into useful snippets, that would be really be handy. It's not easy, though.
A paper, out last month and just accepted for this year’s International Conference on Learning Representations (ICLR) in April, describes just how difficult text summarization really is.
A few companies have had a crack at it. Salesforce trained a recurrent neural network with reinforcement learning to take information and retell it in a nutshell, and the results weren’t bad.
However, the computer-generated sentences are simple and short; they lacked the creative flair and rhythm of text written by humans. Google Brain’s latest effort is slightly better: the sentences are longer and seem more natural.
Here’s an example for the topic: Wings over Kansas, an aviation website for pilots and hobbyists. The paragraph on the left is a computer-generated summary of the organization, and the one on the right is taken from the Wikipedia page on the subject.
Left: Automated Wikipedia entry for Wings over Kansas. Right: The actual Wikipedia entry written by humans. Image credit: Liu et al.
The software-scribbled passage is a bit difficult to read without clear capital letters at the start of new sentences, and most sentences have the same rigid structure. Overall, it’s still pretty readable. The text generation seems to work OK, in your humble vulture's opinion, although for this particular example, the summarization aspect is not great, since it's longer than the corresponding entry in Wikipedia.
The model works by taking the top ten web pages of a given subject – excluding the Wikipedia entry – or scraping information from the links in the references section of a Wikipedia article. Most of the selected pages are used for training, and a few are kept back to develop and test the system.
The paragraphs from each page are ranked and the text from all the pages are added to create one long document. The text is encoded and shortened, by splitting it into 32,000 individual words and used as input.
This is then fed into an abstractive model, where the long sentences in the input are cut shorter. It’s a clever trick used to both create and summarize text. The generated sentences are taken from the earlier extraction phase and aren’t built from scratch, which explains why the structure is pretty repetitive and stiff.
Mohammad Saleh, co-author of the paper and a software engineer in Google AI’s team, told The Register: “The extraction phase is a bottleneck that determines which parts of the input will be fed to the abstraction stage. Ideally, we would like to pass all the input from reference documents.
“Designing models and hardware that can support longer input sequences is currently an active area of research that can alleviate these limitations.”
We are still a very long way off from effective text summarization or generation. And while the Google Brain project is rather interesting, it would probably be unwise to use a system like this to automatically generate Wikipedia entries. For now, anyway.
Also, since it relies on the popularity of the first ten websites on the internet for any particular topic, if those sites aren’t particularly credible, the resulting handiwork probably won’t be very accurate either. You can't trust everything you read online, of course. ®