FYI: Data from deleted GitHub repos may not actually be deleted
And the forking Microsoft-owned code warehouse doesn't see this as much of a problem
Researchers at Truffle Security have found, or arguably rediscovered, that data from deleted GitHub repositories (public or private) and from deleted copies (forks) of repositories isn't necessarily deleted.
Joe Leon, a security researcher with the outfit, said in an advisory on Wednesday that being able to access deleted repo data – such as APIs keys – represents a security risk. And he proposed a new term to describe the alleged vulnerability: Cross Fork Object Reference (CFOR).
"A CFOR vulnerability occurs when one repository fork can access sensitive data from another fork (including data from private and deleted forks)," Leon explained.
For example, the firm showed how one can fork a repository, commit data to it, delete the fork, and then access the supposedly deleted commit data via the original repository.
The researchers also created a repo, forked it, and showed how data not synced with the fork continues to be accessible through the fork after the original repo is deleted. You can watch that particular demo below.
According to Leon, this scenario came up last week with the submission of a critical vulnerability report to a major technology company involving a private key for an employee GitHub account that had broad access across the organization. The key had been publicly committed to a GitHub repository. Upon learning of the blunder, the tech biz nuked the repo thinking that would take care of the leak.
"They immediately deleted the repository, but since it had been forked, I could still access the commit containing the sensitive data via a fork, despite the fork never syncing with the original 'upstream' repository," Leon explained.
Leon added that after reviewing three widely forked public repos from large AI companies, Truffle Security researchers found 40 valid API keys from deleted forks.
You can fork off
Clearly this is a problem. But it's not so much of a problem that GitHub considers CFOR a legitimate vulnerability. In fact, the Microsoft-owned code-hosting giant considers it a feature, not a bug.
When informed of the situation through its Vulnerability Disclosure Program, GitHub responded: "This is an intention design decision and is working as expected as noted in our [documentation]."
This is an intention design decision and is working as expected
This, evidentially, has been known for years. One individual claims to have notified GitHub of the vulnerability back in 2018 and received a similar response.
In a phone interview with The Register, Dylan Ayrey, co-founder and CEO of Truffle Security, explained that the issue comes down to something called a dangling commit.
"A dangling commit is a git primitive," Ayrey explained. "It's not a GitHub primitive. So a dangling commit can exist in any git platform – Bitbucket, GitLab, GitHub, etc. And a dangling commit is basically within a given code repository, you have a tree and that tree represents the history for that project, so all the old versions of the code that are linked together."
A git commit captures a snapshot of a repository's state at a specific point in time, including changes to both code and data. Each commit is uniquely identified by a cryptographic hash. While deleting a branch, for example, removes the reference to a particular commit chain, the commits themselves are not deleted from the repository's object database.
"Those dangling commits, those are like a fundamental documented part of git itself," said Ayrey, who explained that how git platforms deal with dangling commits is a platform decision rather than a git specification.
Bitbucket, GitLab, and GitHub, said Ayrey, have those commits even when the connection to the code tree is severed. If you have the identifier to directly access them, you can still download the associated data.
- Oops. Apple relied on bad code while flaming Google Chrome's Topics ad tech
- The months and days before and after CrowdStrike's fatal Friday
- Kaspersky says Uncle Sam snubbed proposal to open up its code for third-party review
- Forget security – Google's reCAPTCHA v2 is exploiting users for profit
Ayrey said this is widely known. But there's an adjacent issue having to do with forks – copied repositories – that's more specific to GitHub. Forks, he explained, are not part of the git spec, so each platform has its own implementation.
Ayrey said for GitHub, dangling commits can be downloaded via a fork if you have the identifying hash, or some portion of it.
"If you have the identifier you can download them from the repository that they were originally pushed to," he explained. "It turns out you can also download them through any fork of that repository. And it works bi-directionally. So from the parent, you can download that dangling commit from the fork and from the fork you can download that dangling commit from the parent."
"What we found is even if you delete the parent, and the commit was pushed to the parent, that dangling commit not only still lives on, but you can download it through the child even though it was pushed to the parent, it was never pulled into the child, and the parent was deleted, you can now access that dangling commit."
That dangling commit not only still lives on, you can download it through the child even though it was pushed to the parent
What's more, Ayrey explained, you don't even need the full identifying hash to access the commit. "If you know the first four characters of the identifier, GitHub will almost auto-complete the rest of the identifier for you," he said, noting that with just sixty-five thousand possible combinations for those characters, that's a small enough number to test all the possibilities.
Asked about the risks this presents, Ayrey said there's a GitHub events archive that records all public GitHub actions. And he said that just as the Sunlight Foundation's archive of tweets could be used to research public social media statements, GitHub's event archive can be used for forensic investigation into what tech companies have been doing.
"If [tech companies] delete code, if they're going out of their way to delete something, it doesn't always mean anything," he explained. "But oftentimes it means something. It could mean a key or password [was exposed]. It could mean they accidentally pushed up a machine learning data set. We've seen that before. Or it could mean – and this is rare – [that] attacker actually backdoored their project and they were a little bit embarrassed about it … so they just deleted the backdoor."
Asked how GitHub should respond, Ayrey mused, "If a platform makes a vulnerability, documents it, and explains that this is something that you should be aware of that's a known risk, does that make it less of a vulnerability?
"What I would probably advocate for, if I worked there, is that this fork pool isn't shared between forks, that the commits that you push to one fork can't be downloaded through another fork. The other thing that I would probably advocate for is a new feature to be built that allows you to actually permanently delete commits and not just leave them dangling."
Truffle Security argues that GitHub should reconsider its position because the average user expects there to be a distinction between public and private repos in terms of data security, which isn't always true. And there's also the expectation that the act of deletion should remove commit data, which again has been shown to not always be the case.
A GitHub spokesperson told The Register, "GitHub is committed to investigating reported security issues. We are aware of this report and have validated that this is expected and documented behavior inherent to how fork networks work. You can read more about how deleting or changing visibility affects repository forks in our documentation." ®