Software

This article is more than 1 year old

Your anonymous code contributions probably aren't: boffins

Coders are creatures of easily-identifiable habit

Thu 22 Jan 2015 // 08:02 UTC

There's no such thing as an anonymous programmer: your coding style can unmask you, according to research led by Drexel University Comp. Sci. PhD student Aylin Caliskan-Islam.

In work that has serious implications for anyone believing their open source project contributions are anonymous, the researchers find that as many as 95 per cent of contributors to a decent-sized code base can be identified.

That's not a trivial issue, since the researchers cite the case of Iranian coder Saeed Malekpour, condemned to death for developing a photo upload tool (the sentence was later commuted to life in prison).

In her paper (with co-authors from the US Army Research Laboratory, the University of Maryland, Princeton and Germany's University of Goettingen), Caliskan-Islam applied machine learning to identify 250 C++ authors using their coding style on Google Code Jam projects.

From the paper: “Perhaps a programmer has a preference for spaces over tabs, or while loops over for loops, or, more subtly, modular rather than monolithic code. Can elements of coding style be extracted computationally, and if so, what features are most informative?”

Scraping the work of successful contributors to the Google Code Jam competition, the researchers found that a mere eight training files with 70 lines of code each were enough to identify authors based in their syntactic, lexical, and layout habits.

Moreover, code obfuscation doesn't help hide you: “our syntactic feature set is impervious to off-the-shelf code obfuscators, which only change layout and some lexical features”.

The individual programmer's skill is also important to the analysis:

“Programmers who are more advanced and are able to solve more difﬁcult tasks have more distinct coding styles than programmers that are not as advanced. Programmers with a larger skill set can be identiﬁed more easily and with higher conﬁdence. Programmers reveal more individual coding style in more challenging programming tasks.”

As well as software forensics (including tracking authors of malicious software), the paper notes that its techniques could be used to identify plagiarism among computer science students.

Interestingly, the research also reveals that a conventional wisdom – that concise code is good code – may not hold true. It turned out that Google Code Jam contestants that successfully completed the most tasks in the competition wrote longer code than those completing fewer tasks.

The study says: “programmers with better skills tend to write longer code to solve Google Code Jam problems. The mainstream idea is that better programmers write shorter and cleaner code which contradicts with line of code statistics”. ®

Topics

Special Features

Vendor Voice

Resources

Software

Your anonymous code contributions probably aren't: boffins

Coders are creatures of easily-identifiable habit

More about

More about

Narrower topics

More about

More about

More about

Narrower topics

TIP US OFF

Other stories you might like

In-app browsers are still a privacy, security, and choice problem

Microsoft slammed for lax security that led to China's cyber-raid on Exchange Online

Rust developers at Google are twice as productive as C++ teams

Reducing the cloud security overhead

US legislators propose American Privacy Rights Act - and it looks quite good

Majority of Americans now use ad blockers

Sleuths who cracked Zodiac Killer's cipher thank the crowd

Google will delete data collected from 'private' browsing

Meet clickjacking's slicker cousin, 'gesture jacking,' aka 'cross window forgery'

Academics probe Apple's privacy settings and get lost and confused

Microsoft squashes SmartScreen security bypass bug exploited in the wild

US government excoriates Microsoft for 'avoidable errors' but keeps paying for its products

About Us

Our Websites

Your Privacy