There's no such thing as an anonymous programmer: your coding style can unmask you, according to research led by Drexel University Comp. Sci. PhD student Aylin Caliskan-Islam.
In work that has serious implications for anyone believing their open source project contributions are anonymous, the researchers find that as many as 95 per cent of contributors to a decent-sized code base can be identified.
That's not a trivial issue, since the researchers cite the case of Iranian coder Saeed Malekpour, condemned to death for developing a photo upload tool (the sentence was later commuted to life in prison).
In her paper (with co-authors from the US Army Research Laboratory, the University of Maryland, Princeton and Germany's University of Goettingen), Caliskan-Islam applied machine learning to identify 250 C++ authors using their coding style on Google Code Jam projects.
From the paper: “Perhaps a programmer has a preference for spaces over tabs, or while loops over for loops, or, more subtly, modular rather than monolithic code. Can elements of coding style be extracted computationally, and if so, what features are most informative?”
Scraping the work of successful contributors to the Google Code Jam competition, the researchers found that a mere eight training files with 70 lines of code each were enough to identify authors based in their syntactic, lexical, and layout habits.
Moreover, code obfuscation doesn't help hide you: “our syntactic feature set is impervious to off-the-shelf code obfuscators, which only change layout and some lexical features”.
The individual programmer's skill is also important to the analysis:
“Programmers who are more advanced and are able to solve more difﬁcult tasks have more distinct coding styles than programmers that are not as advanced. Programmers with a larger skill set can be identiﬁed more easily and with higher conﬁdence. Programmers reveal more individual coding style in more challenging programming tasks.”
As well as software forensics (including tracking authors of malicious software), the paper notes that its techniques could be used to identify plagiarism among computer science students.
Interestingly, the research also reveals that a conventional wisdom – that concise code is good code – may not hold true. It turned out that Google Code Jam contestants that successfully completed the most tasks in the competition wrote longer code than those completing fewer tasks.
The study says: “programmers with better skills tend to write longer code to solve Google Code Jam problems. The mainstream idea is that better programmers write shorter and cleaner code which contradicts with line of code statistics”. ®
Sponsored: Webcast: Ransomware has gone nuclear