Machine needs more Learning: Google Drive dings single-character files for copyright infringement

If you're unable to share documents, this may be why


Google last month announced plans to prevent customer files stored in Google Drive from being shared when the web giant's automated scanning system finds files that violate its abuse prevention rules.

"When [a file is] restricted, you may see a flag next to the filename, you won't be able to share it, and your file will no longer be publicly accessible, even to people who have the link," Google explained at the time.

That system is now up and running, just not very well: Google Drive's scanning system has been finding copyright violations where they do not exist and flagging innocuous files.

Dr Emily Dolson, assistant professor at Michigan State University, in the departments of Computer Science & Engineering and Ecology, Evolution, & Behavior, had a run-in with the errant scanner recently when she uploaded a file named "output04.txt" that consisted of a single character, the numeral one.

One wonders what exactly upset Google – the digit or the output04.txt filename? Certainly the number 1 does turn up in all manner of copyrighted works. No one let the internet search giant know that Microsoft has its own cloud storage named OneDrive.

"I'm currently teaching a graduate-level algorithms class where students need to write code that solves problems I give them," Dolson told The Register today via email. "I like to make the test cases I use to evaluate the code freely accessible to students to assist them with debugging.

"This issue occurred when I uploaded a large set of files to Drive containing inputs and expected outputs for these test cases. Among the expected output files, there were a few that contained just the character '1'. Shortly after uploading them, I received a string of emails from Google indicating that those files had been flagged for copyright infringement."

Dolson can still access the files, but she cannot share them, which she said was unfortunate because she created them to share with her students.

Others have reported similar experiences. Richard D. Morey, a Reader (UK lingo for professor) in psychology at Cardiff University, responded to Dolson's Twitter post by noting, "I stopped using Google Drive professionally for this reason. It was flagging and pulling down documents I authored myself, and no students could access them!"

And other people responding to Dolson's post claim to have independently replicated the issue by getting small files flagged in Drive.

As has been pointed out by those participating in the Twitter discussion, Europe's GDPR gives people "the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her." US privacy law, however, doesn't really do much for those subject to false algorithmic decisions.

The Register twice asked Google to confirm that Drive's content flagging system is broken but we've not heard back. However, a Google engineering manager responding to Dolson's Twitter thread acknowledged that Google is aware of the issue.

Google's post announcing its Drive abuse notification system depicts a sample message that includes a button labeled "Request a Review," to have a human check the violation scanner's decision.

But Dolson said the automated email notification she received offered no way to push back against the determination of Google's content vetting algorithm – the Request a Review button was not included in the message she received, as can be seen from the screenshot she posted.

Which, you know, is a bit worrying for people concerned about the dead hand of AI being used as arbiter in these matters.

"The e-mail explicitly said 'A review cannot be requested for this restriction,'" she explained. "I do think that it is problematic to automate processes like this without providing any mechanism for a manual override.

Relying on viral social media posts as a sort of backdoor communication channel to the developers should not be the only option

"In this case it's a fairly minor inconvenience (I can just tell my students that the answer is 1), but in a different context it might be a much bigger problem. It's totally normal and understandable for software to have bugs, but that's exactly why there needs to be a mechanism for communicating those bugs back to the developers."

Dolson also took issue with allowing social media to drive customer support.

"Relying on viral social media posts as a sort of backdoor communication channel to the developers should not be the only option – that opens up a heap of equity concerns," she said. "Your ability to receive support for software products should not depend on whether you are sufficiently well connected to technology Twitter."

Netizens reported problems with other numbers, including 0, while the wags over on Hacker News pointed to a mildly relevant Onion article, headlined: "Microsoft Patents Ones, Zeroes."

Because there's always an Onion article where automation drives swathes of the IT world beyond satire. ®

Editor's note: Article updated to include quotes from Dr Emily Dolson.


Other stories you might like

  • Verizon: Ransomware sees biggest jump in five years
    We're only here for DBIRs

    The cybersecurity landscape continues to expand and evolve rapidly, fueled in large part by the cat-and-mouse game between miscreants trying to get into corporate IT environments and those hired by enterprises and security vendors to keep them out.

    Despite all that, Verizon's annual security breach report is again showing that there are constants in the field, including that ransomware continues to be a fast-growing threat and that the "human element" still plays a central role in most security breaches, whether it's through social engineering, bad decisions, or similar.

    According to the US carrier's 2022 Data Breach Investigations Report (DBIR) released this week [PDF], ransomware accounted for 25 percent of the observed security incidents that occurred between November 1, 2020, and October 31, 2021, and was present in 70 percent of all malware infections. Ransomware outbreaks increased 13 percent year-over-year, a larger increase than the previous five years combined.

    Continue reading
  • Slack-for-engineers Mattermost on open source and data sovereignty
    Control and access are becoming a hot button for orgs

    Interview "It's our data, it's our intellectual property. Being able to migrate it out those systems is near impossible... It was a real frustration for us."

    These were the words of communication and collaboration platform Mattermost's founder and CTO, Corey Hulen, speaking to The Register about open source, sovereignty and audio bridges.

    "Some of the history of Mattermost is exactly that problem," says Hulen of the issue of closed source software. "We were using proprietary tools – we were not a collaboration platform before, we were a games company before – [and] we were extremely frustrated because we couldn't get our intellectual property out of those systems..."

    Continue reading
  • UK government having hard time complying with its own IR35 tax rules
    This shouldn't come as much of a surprise if you've been reading the headlines at all

    Government departments are guilty of high levels of non-compliance with the UK's off-payroll tax regime, according to a report by MPs.

    Difficulties meeting the IR35 rules, which apply to many IT contractors, in central government reflect poor implementation by Her Majesty's Revenue & Customs (HMRC) and other government bodies, the Public Accounts Committee (PAC) said.

    "Central government is spending hundreds of millions of pounds to cover tax owed for individuals wrongly assessed as self-employed. Government departments and agencies owed, or expected to owe, HMRC £263 million in 2020–21 due to incorrect administration of the rules," the report said.

    Continue reading
  • Internet went offline in Pakistan as protestors marched for ousted prime minister
    Two hour outage 'consistent with an intentional disruption to service' said NetBlocks

    Internet interruption-watcher NetBlocks has reported internet outages across Pakistan on Wednesday, perhaps timed to coincide with large public protests over the ousting of Prime Minister Imran Khan.

    The watchdog organisation asserted that outages started after 5:00PM and lasted for about two hours. NetBlocks referred to them as “consistent with an intentional disruption to service.”

    Continue reading
  • Suspected phishing email crime boss cuffed in Nigeria
    Interpol, cops swoop with intel from cybersecurity bods

    Interpol and cops in Africa have arrested a Nigerian man suspected of running a multi-continent cybercrime ring that specialized in phishing emails targeting businesses.

    His alleged operation was responsible for so-called business email compromise (BEC), a mix of fraud and social engineering in which staff at targeted companies are hoodwinked into, for example, wiring funds to scammers or sending out sensitive information. This can be done by sending messages that impersonate executives or suppliers, with instructions on where to send payments or data, sometimes by breaking into an employee's work email account to do so.

    The 37-year-old's detention is part of a year-long, counter-BEC initiative code-named Operation Delilah that involved international law enforcement, and started with intelligence from cybersecurity companies Group-IB, Palo Alto Networks Unit 42, and Trend Micro.

    Continue reading

Biting the hand that feeds IT © 1998–2022