Boffins build a NAZI AI – wait, let's check that... OK, it's a grammar nazi
How'd you like those 'robots won't steal your job' headlines now, Reg editors? Muahaha
Pedants, imagine how much more relaxed your life would be if artificial intelligence automatically corrected grammar mistake's in online forum and social network posts.
Never again would you explode with frustration and anger over misplaced apostrophe's, commas full stop's and exclamation! marks! The faults could be fixed up by machine-learning software, and your soul would be soothed.
Software, you say? Yes, software of the kind built by Mengyi Shan, a mathematics student at Harvey Mudd College in California, USA. She trained recurrent neural networks to restore missing punctuation in text. At the moment, it can only deal with commas and full stops, the most common and easiest of English's punctuation marks.
“In natural language processing problems such as automatic speech recognition (ASR), the generated text is normally unpunctuated, which is hard for further recognition or analysis. Thus punctuation restoration is a small but crucial problem that deserves our attention,” she explained last month.
In a project for the Wolfram Summer School held at Bentley University in Boston, Shan trained her recurrent neural networks using three million words that were gathered from 50 novels, plus Wikipedia pages, and transformed into vectors.
The text was filtered so that question marks, exclamation marks, and colons were replaced with full stops. The words were then tagged to indicate whether they were followed by either a comma or full stop. This information, in the form of complete sentences, were fed into the system to train the models so that they could identify common patterns where commas and full stops should appear.
The AI thus ought to pick up that the word "but" is more likely to be followed by a comma than a full stop, and that words like "the" normally feature at the start of sentences so it’s unlikely to be followed by any punctuation at all.
To demonstrate the software, you feed it blocks of sentences, which are converted into sequences of vectors, and passed through the neural network, which outputs the same sentences with full stops and commas added as it thinks is necessary.
Total accuracy isn’t a good measure of performance for the models in this case, she explained. Instead, an F1 score that takes an average of the system’s precision and recall is a better benchmark.
The best F1 score hovers around the 70 per cent mark, and that isn’t good enough to be used in real applications yet. A larger training dataset would help boost scores as would higher quality material.
Facebook pulls plug on language-inventing chatbots? THE TRUTHREAD MORE
Sometimes text from Wikipedia, especially academic citations, contain too many commas, and that can confuse machines and make them inject excessive commas, too. Interestingly, neural networks find it harder to deal with commas than full stops.
“The overall performance on commas is slightly worse than on periods. This also makes sense from a linguistics point of view," Shan explained.
"There seems to be a concrete linguistics set of rules for the period, but the usage of comma greatly depends on personal writing style. For example, you could say either 'I like apples but I don't like bananas.', or 'I like apples, but I don't like bananas.'
“In this way, it's really hard to build a model for comma prediction with such high accuracy. But fortunately, sometimes adding commas or not doesn't really influence the overall meaning of the sentence. So it's OK to be tolerant to a slightly worse performance on commas.”
Shan also added that restoring punctuation shouldn’t be limited to full stops and commas:
A more rigorous study of the question mark, exclamation mark, colon, and quotation mark is expected. However, we should note that the choice of most punctuations is not restricted to one possibility. In cases like distinguishing a period with an exclamation mark, we cannot expect a high F1-score. But it's still an interesting topic, may be useful for topics like sentimental analysis.
You can play with the code right here. ®