Banned: The 1,170 words you can't use with GitHub Copilot

Hash cracking reveals verboten slurs, terms like 'liberals, 'Palestine,' and 'socialist' ... and Quake's famous Fast InvSqrt


GitHub's Copilot comes with a coded list of 1,170 words to prevent the AI programming assistant from responding to input, or generating output, with offensive terms, while also keeping users safe from words like "Israel," "Palestine," "communist," "liberal," and "socialist," according to new research.

Copilot was released as a limited technical preview in July in the hope it can serve as a more sophisticated version of source-code autocomplete, drawing on an OpenAI neural network called Codex to turn text prompts into functioning code and make suggestions based on existing code.

To date, the results have been interesting but not quite compelling – the code produced has been simplistic and insecure, though the project is still being improved.

No one wants to end up as the subject of the next viral thread about AI gone awry

GitHub is aware that its clever software could offend, having perhaps absorbed parent company Microsoft's chagrin at seeing its Tay chatbot manipulated to parrot hate speech.

"The technical preview includes filters to block offensive words and avoid synthesizing suggestions in sensitive contexts," the company explains on its website. "Due to the pre-release nature of the underlying technology, GitHub Copilot may sometimes produce undesired outputs, including biased, discriminatory, abusive, or offensive outputs."

But it doesn't explain how it handles problematic input and output, other than asking users to report when they've been offended.

"There is definitely a growing awareness that abuse is something you need to consider when deploying a new technology," said Brendan Dolan-Gavitt, assistant professor in the Computer Science and Engineering Department at NYU Tandon School of Engineering, in an email to The Register.

"I'm not a lawyer, but I don't think this is being driven by regulation (though perhaps it's motivated by a desire to avoid getting regulated). My sense is that aside from altruistic motives, no one wants to end up as the subject of the next viral thread about AI gone awry."

Hashing the terms of hate

Dolan-Gavitt, who with colleagues identified Copilot's habit of producing vulnerable suggestions, recently found that Copilot incorporates a list of hashes – encoded data produced by passing input through hash function.

Copilot's code compares the contents of the user-provided text prompt fed to the AI model and the resulting output, prior to display, against these hashes. And it intervenes if there's a match. The software also won't make suggestions if the user's code contains any of the stored slurs.

And at least during the beta period, according to Dolan-Gavitt, it reports intervention metrics back to GitHub while making a separate check to make sure the software doesn't reproduce personal information like email or IP addresses from its data model. It appears someone is taking notes from OpenAI's experience.

Dolan-Gavitt over the past few days utilized various techniques to crack the hashes, including comparing them to hashes produced from a word dump of 4chan's /pol/ archive, applying the Z3 constraint solver, and creating a plugin for password cracking tool John the Ripper.

"The list is included in the code of the Visual Studio Code extension that hooks up VSCode to the Copilot service and provides the suggestions – I am told they are planning on moving the list server-side, at which point it will be much harder to inspect," he explained. "So I was able to open up the extension – it's coded in JavaScript – and pull the list out from there."

The list, Dolan-Gavitt said, was encoded such that each word was passed through a hash function that turned the word into a number ranging roughly from negative two billion and positive two billion. A 32-bit hash, it would seem.

"These are generally difficult to reverse, so instead I had to guess different possibilities, compute their hash, and then see if it matched something in the list," he said.

"Over time, I built up increasingly sophisticated ways of guessing words, starting with compiling wordlists from places like 4chan and eventually progressing to heavyweight computer science like GPU-accelerated algorithms and fancy constraint solvers."

The biggest challenge, he said, is not so much finding words that have a particular hash value but determining which of several that have the same value GitHub actually selected.

"There are many words with the same hash value (called 'collisions')," he explained. "For example, the hash value '-1223469448' corresponds to "whartinkala", "yayootvenue", and 'pisswhacker' (along with 800,000 other 11-letter words). So to figure out which ones are the most likely to have been included in the list, I'm using the GPT-2 language model to rank how 'English-like' each word is."

The result was a list of 1,170 disallowed words, 1,168 of which Dolan-Gavitt has decoded and posted to his website with ROT13 encoding – shifting the letters 13 places in the alphabet – to keep hate speech away from search engines and from people who stumble on the page without really wanting to see past the cipher.

Most of the slurs are awful enough that we're not going to reprint them here.

Some of the words, however, are not inherently offensive, even if they could be weaponized in certain contexts. As Dolan-Gavitt demonstrated in a tweet, creating a list of Near East countries in Microsoft's Visual Studio Code with Copilot results in suggestions for "India," "Indonesia," and "Iran," but the software suppressed the obvious next item on the list, "Israel."

Other forbidden words include: palestine, gaza, communist, fascist, socialist, nazi, immigrant, immigration, race, man, men, male, woman, women, female, boy, girl, liberal (but not conservative), blm, black people (but not white people), antifa, hitler, ethnic, gay, lesbian, and transgender, along with various plural forms, to name a few.

Not all bad

"The vast majority of the list is pretty reasonable – I can't say I'm upset that Copilot is prevented from saying the n-word," said Dolan-Gavitt. "Beyond that, there are words that are not offensive, but that GitHub perhaps is concerned could be used in a controversial context: 'transgendered,' 'skin color,' 'israel,' 'palestine,' 'gaza,' 'blm,' and so on. The inclusion of these is more debatable."

Dolan-Gavitt added that some entries on the list look more like an effort to avoid embarrassment than to shield users from offensive text.

"One of the words on the list is 'q rsqrt,' which is the name of a famous function for computing inverse square roots in the code of the game Quake III: Arena. There was a thread that went viral showing that Copilot could reproduce this function, verbatim, as a code suggestion.

"This prompted a lot of concern about whether Copilot would plagiarize code and violate copyright licenses. So by including 'q rsqrt' on the bad word list, they basically broke an embarrassing demo without addressing the real problem."

Dolan-Gavitt gives GitHub's language filter a mixed review.

"It's not a very sophisticated approach – really just a list of bad words," he said. "Solving the problem properly would probably mean going through the training data and eliminating problematic and offensive things there, which is much harder.

"And I believe Copilot is actually a descendant of GPT-3, so it probably has seen not only all the code on GitHub, but also all of GPT-3's training data – which is a significant chunk of the Internet. It's an open question how much of the original GPT-3 remains after being retrained for code, however."

But at least it's something.

"Still, despite the relatively simple approach, it is effective at preventing some of the worst stuff from getting presented to users," he said. "It's a kind of 80 per cent solution that's easy to develop and deploy."

In response to a request for comment, a GitHub spokesperson replied with more or less the message cited above from the Copilot webpage, that Copilot remains a work-in-progress and problematic responses may occur.

"GitHub takes this challenge very seriously and we are committed to addressing it with GitHub Copilot," the spokesperson said. ®

Similar topics


Other stories you might like

  • Heart FM's borkfast show – a fine way to start your day

    Jamie and Amanda have a new co-presenter to contend with

    There can be few things worse than Microsoft Windows elbowing itself into a presenting partnership, as seen in this digital signage for the Heart breakfast show.

    For those unfamiliar with the station, Heart is a UK national broadcaster with Global as its parent. It currently consists of a dozen or so regional stations with a number of shows broadcast nationally. Including a perky breakfast show featuring former Live and Kicking presenter Jamie Theakston and Britain's Got Talent judge, Amanda Holden.

    Continue reading
  • Think your phone is snooping on you? Hold my beer, says basic physics

    Information wants to be free, and it's making its escape

    Opinion Forget the Singularity. That modern myth where AI learns to improve itself in an exponential feedback loop towards evil godhood ain't gonna happen. Spacetime itself sets hard limits on how fast information can be gathered and processed, no matter how clever you are.

    What we should expect in its place is the robot panopticon, a relatively dumb system with near-divine powers of perception. That's something the same laws of physics that prevent the Godbot practically guarantee. The latest foreshadowing of mankind's fate? The Ethernet cable.

    By itself, last week's story of a researcher picking up and decoding the unintended wireless emissions of an Ethernet cable is mildly interesting. It was the most labby of lab-based demos, with every possible tweak applied to maximise the chances of it working. It's not even as if it's a new discovery. The effect and its security implications have been known since the Second World War, when Bell Labs demonstrated to the US Army that a wired teleprinter encoder called SIGTOT was vulnerable. It could be monitored at a distance and the unencrypted messages extracted by the radio pulses it gave off in operation.

    Continue reading
  • What do you mean you gave the boss THAT version of the report? Oh, ****ing ****balls

    Say what you mean

    NSFW Who, Me? Ever written that angry email and accidentally hit send instead of delete? Take a trip back to the 1990s equivalent with a slightly NSFW Who, Me?

    Our story, from "Matt", flings us back the best part of 30 years to an era when mobile telephones were the preserve of the young, upwardly mobile professionals and fixed lines ruled the roost for more than just your senior relatives.

    Back then, Matt was working for a UK-based fixed-line telephone operator. He was dealing with a telephone exchange which served a relatively large town. "I ran a reasonably ordinary, read-only command to interrogate a specific setting," he told us.

    Continue reading

Biting the hand that feeds IT © 1998–2021