More than half of GitHub is duplicate code, researchers find

Boffins beware: random samples are therefore useless for research

Given that code sharing is a big part of the GitHub mission, it should come at no surprise that the platform stores a lot of duplicated code: 70 per cent, a study has found.

An international team of eight researchers didn't set out to measure GitHub duplication. Their original aim was to try and define the “granularity” of copying – that is, how much files changed between different clones – but along the way, they turned up a “staggering rate of file-level duplication” that made them change direction.

Presented at this year's OOPSLA (part of the late-October Association of Computing Machinery) SPLASH conference in Vancouver, the University of California at Irvine-led research found that out of 428 million files on GitHub, only 85 million are unique.

Before readers say “so what?”, the reason for this study was to improve other researchers' work. Anybody studying software using GitHub probably seeks random samples, and the authors of this study argued duplication needs to be taken into account.

As open source watcher Adrian Colyer blogged, “simple random selection is likely to lead to samples including high duplication, which may bias the results of research”, so the paper's resulting public index of code duplication, which they've dubbed “DéjàVu”, helps “understand the similarity relations in samples of projects, or to curate sample to reduce duplicates”.

For example, the study said, if a researcher is studying how many C and C++ programs use assertions, duplication clearly skews their output; similarly, a software quality study needs to take duplication into account.

Software duplication heat map

X = files, Y = commits, Color = dupes. Source: DéjàVu: A Map of Code Duplicates on GitHub, Lopes et al at ACM

DéjàVu maps file clones in Java, C++, JavaScript and Python.

The researchers assessed code duplication using a variety of hash techniques. Identical code was easy, since they produced identical hashes, but it was also necessary to take into account software with small changes (spaces or tabs), or even larger changes.

To draw these other duplicates into their sample, the researchers applied a “token hash” that captured minor changes in spaces, comments, and ordering; and a package called


to capture clones with edits too large for the token hash.

JavaScript was the most cloned environment of all: a mere six per cent of files spawned the other 94 per cent of JavaScript files on GitHub. Of the C++ ecosystem, 73 per cent of files were duplicates; 71 per cent of Python programs were dupes.

Java developers are the most individualistic of the four environments researched, “but even for Java, 40% of the files are duplicates”.

The other thing that probably won't surprise readers is that duplication is primarily dependency-driven. JavaScript provided a good example: people creating a project would commit NPM libraries into their new repositories as if they were part of the application code.

As Colyer wryly noted: “If ever you have felt like you are downloading the universe when running npm install, here’s the data to prove it: including nested dependencies (nesting up to 47 levels deep was discovered, with median five) the number of unique included projects has median 63, and maximum 1261.”

Similarly, nearly all JavaScript programmers suck JQuery into their projects.

There's also programmers' habits as Git users: “there is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and instead, goes in via copy and paste of files and even entire libraries”, the study noted. ®

Similar topics

Other stories you might like

  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading
  • Utility biz Delta-Montrose Electric Association loses billing capability and two decades of records after cyber attack

    All together now - R, A, N, S, O...

    A US utility company based in Colorado was hit by a ransomware attack in November that wiped out two decades' worth of records and knocked out billing systems that won't be restored until next week at the earliest.

    The attack was detailed by the Delta-Montrose Electric Association (DMEA) in a post on its website explaining that current customers won't be penalised for being unable to pay their bills because of the incident.

    "We are a victim of a malicious cyber security attack. In the middle of an investigation, that is as far as I’m willing to go," DMEA chief exec Alyssa Clemsen Roberts told a public board meeting, as reported by a local paper.

    Continue reading

Biting the hand that feeds IT © 1998–2021