This article is more than 1 year old
Yahoo! couldn't! detect! hackers! in! its! network! but! can! spot! NSFW! smut! in! your! office?
Web giant offers open-source AI-powered X-rated pic hunter
Having laid bare over half a billion usernames and passwords through meager funding and witless indifference, Yahoo! is putting its faith in artificial intelligence to protect people from bare skin.
Yahoo! engineers Jay Mahadeokar and Gerry Pesavento in a blog post on Friday said the company has released an open-source model for detecting images deemed "not safe for work" (NSFW).
"To the best of our knowledge, there is no open source model or algorithm for identifying NSFW images," the pair wrote. "In the spirit of collaboration and with the hope of advancing this endeavor, we are releasing our deep learning model that will allow developers to experiment with a classifier for NSFW detection, and provide feedback to us on ways to improve the classifier."
Censorship has been something of a losing proposition for Yahoo!, from its role in the imprisonment of Chinese journalist Shi Tao over a decade ago to its over-enthusiastic spam filters. Nonetheless, the researchers argue that the prevalence of user-generated content makes filtering NSFW images essential for web and mobile applications.
It may be essential for business models that rely on free labor producing content under the pretense of sharing, but that turns out to describe quite a number of internet companies. Alternatively, this software is going to be great for finding and identifying raunchy material on the web.
Yahoo!'s software is a neural network model for Caffe, a deep-learning framework. There are other frameworks that experts in the field rate more highly, such as Torch. Yahoo! also relies on CaffeOnSpark, a framework for running Caffe on Hadoop and Spark clusters.
The NSFW model is designed to take an image and output a smut probability between zero and one, though Mahadeokar and Pesavento note, "we do not provide guarantees of accuracy of output."
Yahoo!'s researchers have declined to release their training images "due to the nature of the data," leaving readers to task of amassing a sufficiently large cache of indiscreet pictures to allow their computers to categorize what they're seeing accurately.
One source might be Google's newly released Open Images library, a dataset of some 9 million URLs pointing at images which may or may not be subject to a Creative Commons Attribution license – Google advises verifying the licensing status of each image. Google intends for its dataset, produced in collaboration with CMU and Cornell universities, to help train neural networks.
We're still waiting to hear from Google whether there are any images in the dataset that would warrant a Yahoo exclamation mark. ®