Microsoft builds image-to-caption AI so that your visually impaired coworkers can truly comprehend your boss's PowerPoint abominations

Better-than-before code to make Office more accessible


Microsoft has built a machine-learning model that automatically captions images in documents and emails so that the descriptions can be dictated by software for visually impaired users. It's claimed to be twice as good as the automatic captioning code the Windows giant already uses in its products.

This latest technology is available now from Azure Cognitive Services' Computer Vision package, and is expected to be added to Word, PowerPoint, and Outlook later this year for Windows and Mac plus PowerPoint for the web.

It uses a technique previously described as visual vocabulary pre-training, or VIVO for short, according to a paper by Microsofties describing their system, distributed late last month on arXiv.

VIVO teaches a large transformer neural network to learn to identify common objects and creatures in images and label them appropriately. The model is then fine-tuned using a second dataset of images labelled with full captions so that the software can figure out how to write basic sentences describing the content of a given image. By pre-training the model, it has a better chance of generating correct captions by recalling entries in its visual dictionary.

The researchers entered their system into nocaps, a challenge that benchmarks the performance of AI image captioning models, and it is currently ranked first on the leaderboard.

Now, they are deploying it in production in the form of Azure services, and soon features in Microsoft 365 and its mobile app Seeing AI. The smartphone application is aimed at people suffering vision loss, and allows them to read short text snippets, recognize everyday objects, and get a verbal description of their surroundings.

Microsoft has provided image-captioning services since 2015, and will update its software to use this newer and better machine-learning model shortly, we're told.

"The auto-generated captions described as 'alt text' will fill in the gaps for images that haven’t been captioned, explained Saqib Shaikh, a software engineering manager at Microsoft’s AI platform group, last week. “Ideally, everyone would include alt text for all images in documents, on the web, in social media – as this enables people who are blind to access the content and participate in the conversation. But, alas, people don’t."

A spokesperson for Microsoft added: “This breakthrough in image description improves the quality of alt-text on images in Microsoft 365, and makes the visual world more accessible to people who are blind." ®


Biting the hand that feeds IT © 1998–2020