Watch out for AI models regurgitating misplaced keys that unlock crypto wallets
Effect of GitHub's OpenAI-powered Copilot memorizing sensitive but public data
Leaving your cryptocurrency wallet's private key out on the public internet is not a good idea: anyone who finds this key can try to use it to drain the wallet of its funds.
And it's not just people on the lookout for these keys: software bots instructed to scan the web for leaked private keys will pick them up soon enough. But there's an added dimension you may not have thought of: the keys being vacuumed up into a dataset to train an AI model that later regurgitates the keys to strangers.
And we may have an example of that: we've heard from programmers who say GitHub's OpenAI-powered code-completion tool Copilot has suggested to them private keys for real cryptocurrency wallets. A wallet's private key should be kept secret as whoever holds it can control wallet and its contents.
The Register spoke to one software developer who said he was using Copilot when it suddenly suggested what looked like a private key to someone's cryptocurrency wallet. He wrote a script to obtain its associated public key and address to check whether it was a valid wallet or not, and was surprised to see it was real and working.
A private key is like the lock to your treasury, if it's leaked then your assets are at risk of being stolen
"It was quite shocking for me," the developer told The Reg. "A private key is like the lock to your treasury, if it's leaked then your assets are at risk of being stolen."
He shared with us the private key Copilot generated, and the associated address, which we verified as legit, and proved he had access to the wallet by processing two transactions: 0.5 ETH was sent to the address from another wallet, and 0.48 ETH was sent from the address back to the same wallet.
Another developer separately told us he managed to retrieve the same private and public key from Copilot, and shared with us the steps he took to obtain the data.
Copilot is built on OpenAI's Codex model, a GPT-3-based system trained on a massive dataset scraped from public GitHub repositories. Language models like GPT-3 and Codex are known to memorize samples from their training data. It follows that Copilot has memorized private keys left in publicly available code, and will resurface them if correctly prompted.
Last year, OpenAI told us it was building a content filter to prevent GPT-3 from regurgitating personally identifiable information, such as real phone numbers and home addresses, that have been ingested into its huge training dataset. It isn't too surprising that Copilot can also recall cryptographic keys if they appeared in public code repositories at some point.
The private key suggested by the AI pair-programming tool does show up in public GitHub repositories, and appears to have been created and used previously for testing purposes. The wallet is active, and has been used to send and receive real tokens.
Other developers have managed to find private keys associated with other cryptocurrency wallets, a few containing a low amount of money.
In other words, this is another good reason why sensitive info, such as these keys, shouldn't be accidentally left out in the open: they may end up in a training set at some point, and emitted later on by a machine-learning model. In this case, there are plenty of people out there already scouring the web for misplaced wallet private keys, so by the time they end up in an AI model, it's probably too late for the wallet's owner. However, the principle is worth highlighting: sensitive leaked info of any kind may make its way into a trained model and later repeated.
- GitHub Copilot auto-coder snags emerge, from seemingly spilled secrets to bad code, but some love it
- Google wins book scan battle. Again. Can post pages online. Again
- GitHub Copilot is AI pair programming where you, the human, still have to do most of the work
- Good news: Google no longer requires publishers to use the AMP format. Bad news: What replaces it might be worse
Ari Herbert-Voss, a former OpenAI research scientist and co-founder of an in-stealth crypto security startup, told The Register Copilot, by its design, can only leak information that has been publicly posted online.
"People are excited about this being another method to find wallets to drain, but note that all extractable wallets were already publicly available," he said. "Many people trawl GitHub for key leakage, which means that would-be high-value accounts are more likely to already be drained. The real risk is if someone isn't paying attention and keeps using a compromised wallet."
The real risk is if someone isn't paying attention and keeps using a compromised wallet
He warned people to not post their private keys in GitHub repositories. Future versions of Copilot and similar AI code-generation software may memorize further sensitive data.
"Large language models perform better when they have more data," he told us, "so there's also an incentive for companies and labs to find additional sources of code. If that data contains any private keys then future models can leak those keys, too.
"People should be extra careful about checking in private keys to GitHub and be mindful of transmitting sensitive data through unknown and potentially compromisable nodes because the attack surface for these compromised wallets is getting larger."
OpenAI declined to comment. ®