This article is more than 1 year old
Facebook's $500k deepfake-detector AI contest drama: Winning team disqualified on buried consent technicality
Oh OK, so NOW the social network cares about getting people's permission before using their data to train computer systems
Special report Five engineers missed out on sharing a top prize of $500,000 in a Facebook-backed AI competition – after they were disqualified for using images scraped from Flickr and YouTube to train their deepfake-detecting system.
The contest, dubbed the Deepfake Detection Challenge, was hosted on Kaggle, a Google-owned platform popular in the data-science community. Teams competing in the contest – devised and launched in December by Facebook along with AWS, Microsoft, and the non-profit Partnership in AI (PAI) – were challenged to build machine-learning models that could accurately determine whether or not videos contained so-called deepfake material.
A team called All Faces Are Real was on the home stretch to win the deepfake-detector competition by producing an AI that outperformed its rivals in terms of accuracy. In April, all of the machine-learning models submitted to the contest were judged by Kaggle, and publicly ranked in a leaderboard. All Faces Are Real was in pole position and awaiting its six-figure payday.
The top five teams were next each asked to submit their code, and documentation describing how their system was trained, among other implementation details, so Kaggle could verify all the rules and requirements were followed. The All Faces Are Real boffins were confident they were on track to bag $100,000 (£79,236, €112,000) each for acing the contest.
However, a few days after submitting their paperwork, the winning team's dream crashed harder than a failed disk drive: its submission was removed from the leaderboard after Kaggle and Facebook said the group had broken the rules in the competition's fine print.
All participants in the contest were allowed to use third-party datasets as well as one provided by Facebook to train their deepfake-detecting systems, if those outside datasets were publicly available for all and could be used for commercial purposes. Each team had to publicly declare which outside datasets it was using before the competition deadline. The aim of the contest, seemingly, was to develop good machine-learning software rather than spark a race to gather the best training materials.
All Faces Are Real felt it had stuck to those rules. It used 50,000 pictures from a public dataset compiled by Nvidia from image-hosting site Flickr. On top of that, the team scraped 16 or so public videos from YouTube that were covered by a Creative Commons license.
“We chose these data sources with the belief that they met the rules on external data, specifically that external data must be 'available to use by all participants of the competition for purposes of the competition at no cost to the other participants', and the additional statements in the external data thread that they must be available for commercial use and not restricted to academics,” the team said in a statement.
Facebook and Kaggle, however, decided to disqualify the team on the grounds it failed to obtain explicit consent from everyone depicted in the Flickr images and YouTube videos used to train their deepfake-detecting model. All participants using external datasets had to have received written permission granting them the right to use an individual’s image in said datasets for the competition, Kaggle and Facebook said.
The pair pointed to a few lines tucked inside the competition’s red tape, which stated: “If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request.”
Almost all facial-recognition datasets are problematic
The contest's participants were shocked. They felt this was not explicitly mentioned nor stressed by the organizers when the teams publicly declared the use of any external datasets. Back in February, Julia Elliott, a manager at Kaggle who helped manage the competition, warned everyone not to use training datasets that could not be used commercially or by other teams. Elliott told competitors:
As stated many times previously, if there are any restrictions imposed on the dataset's use (including non-commercial use only or restriction on those who have access to the dataset), that is considered in violation of the requirement that the data be "available to use by all participants of the competition" and therefore prohibited. This should be quite clear at this point.
If your model makes use of any external data that is prohibited by the rules, then you are subject to disqualification by the host upon review of your solution, in particular if you are a prospective winner.
There was no explicit mention in the competition's top-level documentation of having to get written permission from every individual in the extra datasets. And even if the requirement was stated more clearly, it’s a near-impossible task. Computer-vision researchers building state-of-the-art facial-recognition datasets often do not get explicit permission. For example, ImageNet, one of the most widely used academic resources, is made up of pictures of people and things scraped from public websites under license.
Facebook to ban deepfake videos in posts and ads, sort of: Vids must be believable, made by AI, and not be parody
READ MOREFirstly, it’s difficult to figure out the identity of people in these images. How does one realistically and efficiently go about contacting folks for permission if they cannot be tracked down? Secondly, it’s incredibly tedious to manually contact tens of thousands of people, and wait for their replies. In cases where the harvested images were correctly obtained and used under a license, such as Creative Commons, that allows them to be downloaded and reused, there's now a new sticking point, that when people shared their pictures under a liberal Creative Commons license a decade ago, say, they probably didn't expect their snaps to be eventually gathered up to train facial-recognition, object-detection, and other computer-vision systems.
Even if data scientists keep within the permissive licenses of the images they've hoovered up, there is pressure on them to go find everyone pictured in the datasets and ask those people if they really, truly wanted to share their material as liberally as they desired years prior. Laws, such as the US state of Illinois's Biometric Information Privacy Act, which requires organizations get people's explicit consent before adding their biometric info to a database, make corporations like Facebook jumpy as well as headlines about stuff being used without permission. All of which may explain why the consent requirement was buried in the competition's rules.
“Anyone using external datasets would have probably fallen foul of this rule," Mikel Bober-Irizar, a member of All Faces Are Real, told The Register.
After the team was disqualified, the second-placed contestant made up of just one lone engineer, Selim Seferbekov, was bumped up to first place. Seferbekov did not violate any of strict rules about getting explicit permission simply because he didn’t use any external datasets at all, and is set to single-handedly scoop half a million dollars.
Send in the lawyers
All Faces Are Real hired Ed Boal, a British lawyer at Stephenson Law, to wrestle back the prize money they believed was rightfully theirs.
“The team acted diligently to ensure that any external datasets used to train their model could be used by all participants without charge and for commercial purposes, as required by the rules,” Boal told El Reg this week.
“What the team didn’t appreciate – nor, it seems, did any of the other competitors based on the Kaggle discussion board – was that if they used an external dataset which included images of individuals, they needed to produce evidence that every individual in that dataset consented to the use of their image for the purposes of the competition.
Facebook was able to secure these consents for the competition dataset because Facebook generated that dataset itself using its extensive resources. But it’s hard to see how Facebook or Kaggle could have expected any competitors to do the same
“Facebook was able to secure these consents for the competition dataset because Facebook generated that dataset itself using its extensive resources. But it’s hard to see how Facebook or Kaggle could have expected any competitors to do the same – even pre-trained models that are derived from image databases, which many competitors used, probably don’t have these consents.
“The team has never disputed the need for Facebook to protect its legal position and its reputation. However, if you invite developers to take part in a competition which allows them to use external data to train their models, without flagging any requirement to obtain individual consent from every individual featured in images within an external dataset, it seems obvious what the outcome will be. And, unfortunately for the All Faces Are Real team, they’ve discovered this the hard way.”
After the team and their lawyer sat down to talk to representatives from Facebook and Kaggle in a Zoom call, they failed to convince the contest organizers to let them retain the top spot. Instead, Facebook and Kaggle allowed them to submit their machine-learning model without any training from external datasets. This bumped them down to seventh place, and they narrowly missed out collecting anything from the prize pot.
Lawyers hail 'superb result' in Facebook biometric privacy battle: They'll get 25% of $550m, Illinois gets the rest
READ MORE“There's nothing we can really do,” said Bober-Irizar. “You'd have to sue them in California, and we don't want to spend years suing Facebook in America. We'd rather just keep doing science."
He also pointed out that it also wasn’t worth going after Kaggle, either, since there was a small clause in the rules that stated even if the San-Francisco-based platform was ordered to pay damages, it would only fork out up to ten dollars, or about eight quid.
“Facebook's interpretation of the rule regarding external data is unrealistic for competitors to follow, and they have failed completely at communicating their requirements during the competition, and in the end penalised participants for their failure," fellow team member Yifan Xie added.
To make matters worse, Facebook flaunted the All Faces Are Real team's work in a presentation during the virtual academic Conference on Computer Vision and Pattern Recognition, held this month. Christian Canton, head of Facebook's AI Red Team, boasted that the lowest error rates achieved in the competition was 0.423 – the score reached by the All Faces Are Real team.
"If our solution is not acceptable, why parade our score in your presentation?" Xie asked. Later, Canton apologized, and blamed the mistake on a "typo." The All Faces Are Real team scored 0.42320 while Seferbekov scored 0.42798, which rounds up to 0.423. The lower the error rare, the better. Canton has since corrected his presentation slide, we're told.
Kaggle defends its decision
Kaggle has faced heavy criticism over the Deepfake Detection Challenge; many developers complained the written permission rule should have been made clear enough from the start, and that the request was ridiculous to begin with.
"I’d like to attempt to clarify the underlying issue with All Faces are Real’s disqualified submission," Elliott said this week. "Some of the videos/images used in the disqualified submission were mis-licensed, in that they contained content belonging to other third parties (such as CNN), but were inappropriately offered under open source licenses. This content also clearly depicted third parties and used third party data whose permissions had not been obtained, in violation of the competition rules."
"Retrospectively, we recognize that the "submission documentation" rule’s application to include external data could have been reinforced. It was not our expectation that this was unclear, given the inclusion of external data as part of that documentation. Unfortunately, the specific videos' contents and mislicensing were not anticipated in order to know this would arise as an issue. However, we now acknowledge this was a source of misunderstanding.
"We absolutely could have done better ... Without our community, Kaggle would cease to exist. Our hosts will tell you that we consistently default to standing and siding with our users. We will continue to advocate for our community and commit to not allowing this to happen again."
Facebook and Kaggle declined to comment further. ®