Look out, Wiki-geeks. Now Google trains AI to write Wikipedia articles

Er, well, ish. Text summarization is still pretty tricky for non-humans, though


A team within Google Brain – the web giant's crack machine-learning research lab – has taught software to generate Wikipedia-style articles by summarizing information on web pages... to varying degrees of success.

As we all know, the internet is a never ending pile of articles, social media posts, memes, joy, hate, and blogs. It’s impossible to read and keep up with everything. Using AI to tell pictures of dogs and cats apart is cute and all, but if such computers could condense information down into useful snippets, that would be really be handy. It's not easy, though.

A paper, out last month and just accepted for this year’s International Conference on Learning Representations (ICLR) in April, describes just how difficult text summarization really is.

A few companies have had a crack at it. Salesforce trained a recurrent neural network with reinforcement learning to take information and retell it in a nutshell, and the results weren’t bad.

However, the computer-generated sentences are simple and short; they lacked the creative flair and rhythm of text written by humans. Google Brain’s latest effort is slightly better: the sentences are longer and seem more natural.

Here’s an example for the topic: Wings over Kansas, an aviation website for pilots and hobbyists. The paragraph on the left is a computer-generated summary of the organization, and the one on the right is taken from the Wikipedia page on the subject.

AI

Left: Automated Wikipedia entry for Wings over Kansas. Right: The actual Wikipedia entry written by humans. Image credit: Liu et al.

The software-scribbled passage is a bit difficult to read without clear capital letters at the start of new sentences, and most sentences have the same rigid structure. Overall, it’s still pretty readable. The text generation seems to work OK, in your humble vulture's opinion, although for this particular example, the summarization aspect is not great, since it's longer than the corresponding entry in Wikipedia.

The model works by taking the top ten web pages of a given subject – excluding the Wikipedia entry – or scraping information from the links in the references section of a Wikipedia article. Most of the selected pages are used for training, and a few are kept back to develop and test the system.

The paragraphs from each page are ranked and the text from all the pages are added to create one long document. The text is encoded and shortened, by splitting it into 32,000 individual words and used as input.

This is then fed into an abstractive model, where the long sentences in the input are cut shorter. It’s a clever trick used to both create and summarize text. The generated sentences are taken from the earlier extraction phase and aren’t built from scratch, which explains why the structure is pretty repetitive and stiff.

Mohammad Saleh, co-author of the paper and a software engineer in Google AI’s team, told The Register: “The extraction phase is a bottleneck that determines which parts of the input will be fed to the abstraction stage. Ideally, we would like to pass all the input from reference documents.

“Designing models and hardware that can support longer input sequences is currently an active area of research that can alleviate these limitations.”

We are still a very long way off from effective text summarization or generation. And while the Google Brain project is rather interesting, it would probably be unwise to use a system like this to automatically generate Wikipedia entries. For now, anyway.

Also, since it relies on the popularity of the first ten websites on the internet for any particular topic, if those sites aren’t particularly credible, the resulting handiwork probably won’t be very accurate either. You can't trust everything you read online, of course. ®

Similar topics


Other stories you might like

  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading
  • Utility biz Delta-Montrose Electric Association loses billing capability and two decades of records after cyber attack

    All together now - R, A, N, S, O...

    A US utility company based in Colorado was hit by a ransomware attack in November that wiped out two decades' worth of records and knocked out billing systems that won't be restored until next week at the earliest.

    The attack was detailed by the Delta-Montrose Electric Association (DMEA) in a post on its website explaining that current customers won't be penalised for being unable to pay their bills because of the incident.

    "We are a victim of a malicious cyber security attack. In the middle of an investigation, that is as far as I’m willing to go," DMEA chief exec Alyssa Clemsen Roberts told a public board meeting, as reported by a local paper.

    Continue reading

Biting the hand that feeds IT © 1998–2021