Forget anonymity, we can remember you wholesale with machine intel, hackers warned

Resistance coders, malware writers, and copyright infringers take note


32c3 Anonymous programmers, from malware writers to copyright infringers and those baiting governments with censorship-foiling software, may all be unveiled using stylistic programming traits which survive into the compiled binaries – regardless of common obfuscation methods.

Youtube Video

The work, titled De-anonymizing Programmers: Large Scale Authorship Attribution from Executable Binaries of Compiled Code and Source Code, was presented by Aylin Caliskan-Islam to the 32nd annual Chaos Communications Congress on Tuesday.

It was accompanied by the publication of an arxiv [PDF] titled When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries, written by researchers based at Princeton University in the US, one of whom is notably part of the Army Research Laboratory.

The researchers began trying to identifying malicious programmers, noting that there is "no technical difference" between security-enhancing use-cases for mapping the style of posts, and privacy-infringing use cases. In other words, writing style betrays the writer.

Many of the distinguishing features (such as variable names) in the C/C++ source code compiled and analysed by the researchers are removed when that code is compiled, and compiler optimisation procedures may further alter the structural qualities of programs, obfuscating authorship even further.

However, in examining the authorship of executable binaries "from the standpoint of machine learning, using a novel set of features that includes ones obtained by decompiling the executable binary to source code," the researchers were able to show "that many syntactical features present in source code do in fact survive compilation and can be recovered from [the] decompiled executable binary."

The researchers used "state-of-the-art reverse engineering methods" (as displayed in the above graph) to "extract a large variety of features from each executable binary" that would represent the stylistic quirks of programmers using feature vectors.

Practically, this meant querying the Netwide disassembler and then the Radare2 disassembler, before using both the "state-of-the-art" Hex-Ray decompiler and the open source Snowman decompiler to extract 426 stylometrically significant feature vectors from the binaries for comparison.

A random forest classifier was then trained on eight executable binaries per programmer, to generate accurate author models of coding style. It was thus capable of attributing authorship "to the vectorial representations of previously unseen executable binaries."

The researchers noted that: "While we can de-anonymize 100 programmers from unoptimized executable binaries with 78 per cent accuracy, we can de-anonymize them from optimized executable binaries with 64 per cent accuracy."

"We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66 per cent, which is a surprisingly small drop. This suggests that coding style survives complicated transformations."

In their future work, the researchers plan to investigate whether stylistic properties may be "completely stripped from binaries to render them anonymous" and also to look at real-world authorship attribution cases, "such as identifying authors of malware, which go through a mixture of sophisticated obfuscation methods by combining polymorphism and encryption." ®

Similar topics

Broader topics


Other stories you might like

  • Cerebras sets record for 'largest AI model' on a single chip
    Plus: Yandex releases 100-billion-parameter language model for free, and more

    In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.

    "Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."

    The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.

    Continue reading
  • AI and ML could save the planet – or add more fuel to the climate fire
    'Staggering amount of computation' deployed to solve big problems uses a lot of electricity

    AI is killing the planet. Wait, no – it's going to save it. According to Hewlett Packard Enterprise VP of AI and HPC Evan Sparks and professor of machine learning Ameet Talwalkar from Carnegie Mellon University, it's not entirely clear just what AI might do for – or to – our home planet.

    Speaking at the SixFive Summit this week, the duo discussed one of the more controversial challenges facing AI/ML: the technology's impact on the climate.

    "What we've seen over the last few years is that really computationally demanding machine learning technology has become increasingly prominent in the industry," Sparks said. "This has resulted in increasing concerns about the associated rise in energy usage and correlated – not always cleanly – concerns about carbon emissions and carbon footprint of these workloads."

    Continue reading
  • Amazon can't channel the dead, but its deepfake voices take a close second
    Megacorp shows Alexa speaking like kid's deceased grandma

    In the latest episode of Black Mirror, a vast megacorp sells AI software that learns to mimic the voice of a deceased woman whose husband sits weeping over a smart speaker, listening to her dulcet tones.

    Only joking – it's Amazon, and this is real life. The experimental feature of the company's virtual assistant, Alexa, was announced at an Amazon conference in Las Vegas on Wednesday.

    Rohit Prasad, head scientist for Alexa AI, described the tech as a means to build trust between human and machine, enabling Alexa to "make the memories last" when "so many of us have lost someone we love" during the pandemic.

    Continue reading

Biting the hand that feeds IT © 1998–2022