This article is more than 1 year old
Boffins debunk study claiming certain languages (cough, C, PHP, JS...) lead to more buggy code than others
Hard evidence that some coding lingo encourage flaws remains elusive
Tempting through it may be to believe that certain programming languages promote errors, recent research finds little if any evidence of that.
A scholarly paper, "A Large Scale Study of Programming Languages and Code Quality in Github," presented at the 2014 Foundations of Software Engineering (FSE) conference, made that claim that some computer languages show higher levels of buggy code, setting off a firestorm of developer comment.
So, computer science boffins from University of Massachusetts Amherst, Northeastern University, and Czech Technical University in Prague tried to replicate the study.
In a paper distributed via ArXiv this week, titled "On the Impact of Programming Languages on Code Quality," Emery Berger, Celeste Hollenbeck, Petr Maj, Olga Vitek and Jan Vitek revisit the four major findings of the 2014 paper to evaluate the assumption that programming language design matters.
They found no evidence of that. Their attempt to reproduce the 2014 research mostly failed. Their analysis indicates that flaws show up in C++ code just a bit more often than they should, but even that they say is statistically insignificant.
Correlation and causation
In a phone interview with The Register, Emery Berger, computer science professor at the University of Massachusetts Amherst, said it's important to draw a distinction between what a failure to reproduce means and what's the actual state of affairs.
The original study purported to establish a correlation between programming languages and errors, one that people misinterpreted as a causal relationship, he said.
"This doesn't mean it's not true," said Berger. "It just means that many of their claims failed to hold up. There's a joke among data scientists that if you torture the data long enough it will eventually speak. Just because you have data, it doesn't mean it's the right data to establish particular claims. GitHub repo data is a great resource, but not all facts can be ascertained by analyzing it."
The 2014 findings, based on analysis of code published to GitHub, include:
“Some languages have a greater association with defects than others, although the effect is small.
"There is a small but significant relationship between language class and defects. Functional languages have a smaller relationship to defects than either procedural or scripting languages.”
“There is no general relationship between domain and language defect proneness.”
"Defect types are strongly associated with languages."
But when the researchers attempted to replicate the earlier study, mostly they could not. For the first proposition, they found small differences in the numbers of bugs associated with particular programming languages, but not enough to matter.
And they were unable to repeat the findings that led to the remaining three propositions, with the last two attempts to recreate results hobbled by missing data.
It's a science
"Unfortunately, our work has identified numerous problems in the FSE study that invalidated its key result," state Berger, Hollenbeck, Maj, Vitek and Vitek in their paper. "Our intent is not to blame, performing statistical analysis of programming languages based on large-scale code repositories is hard."
The failure to reproduce the results does not mean the opposite: that it doesn't matter whether you write in a programming language that's functional, procedural, or object-oriented; that it doesn't matter whether the language is statically or dynamically typed; or that it doesn't matter whether it's strongly typed or weakly typed. The data doesn't show that either.
Berger said that looking beyond this specific study, the broader question question is whether programming languages make a difference.
"I believe they do in my heart of hearts," he said. "But it's kind of an impossible experiment to run."
He said it's probably true that programmers writing in Haskell, for example, are more academically trained than the average Python user. "Let's pretend that Haskell programs have fewer bugs," he said. "It could just be the case that more programmers using Haskell have PhDs."
In other words, there's a lot of context that doesn't get incorporated into an analysis GitHub data.
This contrary result is how science is supposed to work; experimental results should be tested for reproducibility and often that never happens. It should: A 2016 report published in Nature found that half of 1,576 scientists surveyed could not even reproduce their own work.
To their credit, the original researchers anticipated the possibility of flaws in their work, noting several potential threats to the validity of their conclusions.
For the boffins from UMass, Northeastern and CTU, their findings offer a warning about data science pitfalls and underscore the need for studies that can be automated for easier reproduction.
"While statistical analysis combined with large data corpora is a powerful tool that may answer even the hardest research questions, the work involved in such studies – and therefore the possibility of errors – is enormous," they conclude. ®