Original URL: https://www.theregister.com/2013/04/18/github_licensing_study/

Study: Most projects on GitHub not open source licensed

Kids these days, they just don't care

By Neil McAllister in San Francisco

Posted in OSes, 18th April 2013 01:22 GMT

Code-sharing website GitHub has grown so popular that it and open source are practically synonymous for many developers. But new research shows that most of the projects now on GitHub are released under license terms that are unclear, inconsistent, or nonexistent, leaving their legal status as open source software uncertain.

That's according to Aaron Williamson, senior staff counsel at the Software Freedom Law Center, who presented some of his findings on the matter at the Linux Collaboration Summit in San Francisco on Wednesday.

Williamson first became interested in software licensing trends on GitHub after reading a somewhat profane Twitter post from Redmonk's James Governor in 2012, to the effect that today's young developers can't be bothered to deal with the complexities of open source licensing and governance.

At the time, Governor's post inspired much debate on Twitter and beyond. But was what he said true? Are younger developers on GitHub really less likely to specify clear licensing for their projects than earlier generations of coders? Williamson decided to find out.

To that end, he wrote a Python script that continuously polled GitHub, looking for license files. He then ran those files through FOSSology, a tool developed by HP and some others that can identify software licenses by the specific language and phrases contained in them.

Williamson was quick to point out that his study was by no means scientific, nor was his data set complete. GitHub's APIs throttles the number of requests you can make per hour, so Williamson was unable to poll the entire archive – in fact, he only made it through the oldest 28 per cent of the repositories. He's also fairly certain that he missed some licenses and that there were some errors and duplications in the data. Still, his results are eye opening.

Licenses? Bah, who needs 'em

According to Williamson, out of the 1,692,135 code repositories he scanned, just 219,326 of them – 14.9 percent – had a file in their top-level directories that identified any kind of license at all. Of those, 28 per cent only announced their licenses in a README file, as opposed to recommended filenames such as LICENSE or COPYING.

Equally interesting, Williamson found that developers with projects on GitHub tend to shun so-called copyleft licenses such as the Gnu General Public License (GPL) – which require modified versions of the software to be released under the same license as the original – in favor of more permissive alternatives.

  Chart showing license use among GitHub projects  

Most developers on GitHub seem to prefer permissive licenses to the GPL (Source: Aaron Williamson)

Naturally, the GPL was still well represented. Williamson found some 61,000 projects that were licensed under some version of the GPL or Lesser GPL. But his scans turned up fully twice as many projects that were released under either the MIT, BSD, or Apache licenses, none of which are copyleft licenses.

Williamson added that although his data was just a snapshot, and therefore couldn't be used to establish any trends, data gathered by Redmonk does indicate an overall trend toward permissive licensing for projects written in many different languages.

Just why that is wasn't clear. But Luis Villa, deputy general counsel at the Wikimedia Foundation, has suggested that younger developers may be choosing more permissive licenses as a way of pushing back against what they see as a "permission culture." They prefer to let other developers just do whatever they want with their code – and, rightly or wrongly, this might be a reason why many projects are released with no license whatsoever.

Even when GitHub repositories included licenses, however, Williamson also found a lot of projects where the licensing was unclear. For example, many projects claimed to be licensed under "the Ruby license," but Ruby's licensing has changed over time, making it difficult to figure out just what the terms are for any specific project if they aren't stated explicitly.

Still other projects offered terms that were inconsistent; for example, a program that claimed to be licensed under the GPL but "for non-commercial use only," which contradicts the GPL's terms.

What's going on with the kids these days?

As for how to explain Williamson's findings, and why today's developers don't seem to be following open source licensing best practices, speculation ran rampant. It's a question that's hard to answer.

One theory proffered was that the acceptance of licensing best practices among open source developers is changing simply because the community itself is changing. It's already dramatically different from how it was in its early days, when the open source community and the Linux community were practically synonymous.

"People looked up to Linus and the Free Software Foundation and were following their example in how they were licensing their software," Williamson said. "But now the open source community is a lot broader than the Linux community, and the relevant platform for a lot of new development and that younger developers are familiar with is the web, rather than Gnu/Linux."

  Chart showing language use on GitHub  

What? You thought C would win? (Source: A. Williamson)

Indeed, Williamson's data bears that out. Of the 1.7 million GitHub projects he analyzed, the largest portion – 21 per cent – were written in JavaScript. Coming in second with 12 per cent was Ruby, a language most often associated with the Rails web framework.

JavaScript developers, in particular, are prone to ignore software licensing concerns, because large-scale infrastructure projects written in JavaScript are something relatively new. Historical JavaScript practice hardly resembled the rigorous development process of something like the Linux kernel.

"For a long time, JavaScript was sort of the necessary evil of the web," Williamson observed. "There were pieces here and there to make things happen on web pages, but there weren't a lot of cohesive projects. There was a lot of code being shared on websites and blogs. And so I think the historical patterns of JavaScript re-use didn't necessarily correspond to the way that other free software projects are distributed and use licensing."

And what can be done about it?

But not everyone in the audience agreed that the shift to web languages was to blame for poor licensing practice, or even there was a clear trend. In fact, Richard Fontana, open source licensing and patent counsel for Red Hat, questioned whether developers releasing code without a clear license was even a new phenomenon.

"I remember before GitHub existed; it wasn't so long ago," Fontana said. "There were a lot of repositories on SourceForge that had no licensing information. If you go back even further, to the earliest days of free software, it was very common to share source code on Usenet without any licensing information."

One thing most audience members seemed to agree on, however, was that in today's legal climate, releasing code that is poorly licensed is often no better than not releasing code at all. As one developer observed, "I see so many companies where, when an unlicensed thing comes in, it gets deleted. And I just think that's a waste of effort from the open source community."

Just what to do about it, however, remains an open question. Several audience members suggested pressuring GitHub to prompt developers to specify a license whenever they start new projects, but GitHub's position is to default to an "all rights reserved" license, because it doesn't want its users giving away rights they don't understand.

The best solution the group came up with, in the end, was for people who like a certain project but are uncertain about its licensing to simply ask the developer to specify a license. They could file a bug report on the project's GitHub site, for example.

"People leave stuff out," Williamson explained. "They're busy writing code, and they just don't think about it. It's annoying – it's a slight irritant to put a license file with headers in your code, and people don't do it. But when asked, they often do." ®