How AI can help reverse-engineer malware: Predicting function names of code

Or: What kind of research Google's getting in its Mandiant takeover


GTC Disassembling and analyzing malware to see how it works, what it's designed to do and how to protect against it, is mostly a long, manual task that requires a strong understanding of assembly code and programming, techniques and exploits used by miscreants, and other skills that are hard to come by.

What with the rise of deep learning and other AI research, infosec folks are investigating ways machine learning can be used to bring greater speed, efficiency, and automation to this process. These automated systems must cope with devilishly obfuscated malicious code that's designed to evade detection. One key aim is to have AI systems take on more routine work, freeing up reverse engineers to focus on more important tasks.

Mandiant is one of those companies seeing where neural networks and related technology can change how malware is broken down and analyzed. At this week at Nvidia's GTC 2022 event, Sunil Vasisht, staff data scientist at the infosec firm, presented one of those initiatives: a neural machine translation (NMT) model that can annotate functions.

This prediction model, from what we understand, can take decompiled code – machine-language instructions turned back into corresponding high-level language code – and use this to suggest appropriate, descriptive names for each of the function blocks. This is for when function or symbol names have been stripped from a binary or obfuscated, and is an alternative to signature-based tools, such as IDA FLIRT.

If you're a reverse engineer, you can skip the functions that, for instance, get the OS to handle a printf() call, and go right to the functions identified as performing encryption or raising privileges. You can ignore a block that's labeled by the model as tolower(), and go after the inject_into_process() one. You can avoid wasting time on dead-ends or inconsequential functions.

Specifically, the model works by predicting function name keywords (eg, 'get', 'registry', 'value') from abstract syntax tree (AST) tokens from decompiled executable files. It was shown that the model was able to label one function as 'des', 'encrypt', 'openssl', 'i386', 'libeay32', whereas an analyst involved in the experiment was only able to suggest encode(). Mandiant also built a second NMT that made predictions from control flow graphs and API calls of code.

Vasisht outlined the typical methods that are used to reverse engineer malware and the myriad challenges that come with that, including the techniques malware creators use to build their code to make it more difficult for threat hunters to find and disassemble it. It makes for what is becoming an untenable situation.

"Reversing is an extremely difficult job and throwing more analyst hours at the problem is not sustainable," he said during his presentation.

By automating function annotations, Mandiant is aiming to address the broad challenges most reverse engineers encounter when analyzing modern malware. The vendor, bought by Google for $5.4bn, wants to scale up reporting of malware functionality and capabilities, reduce the challenges its analysts face, and make reversing more efficient. In other words, make it easier to pinpoint the heart of tricky malware code. We imagine this could also be useful for comparing malware strains.

"We hope to tackle the easy cases so that the analysts can spend their precious time on more important cases," Vasisht said. "At Mandiant, these are the challenges that we set out to tackle with a unified machine learning approach. Our problem statement is: how can we increase function name coverage within binary disassembly in order to accelerate malware triage?"

We hope to tackle the easy cases so that the analysts can spend their precious time on more important cases

Malware analysts use a number of techniques that fall under static and dynamic analysis; the former involves studying the executable code, the latter involves running it and observing its operation. There are tools like IDA Pro, Binary Ninja, Ghidra, and debuggers and emulators and hypervisors, to help with this. Even so, decompiled and disassembled functions can be hard to follow, forcing reversers to spend hours before they understand what a section of code is doing, and many samples are far too large for a complete analysis. Code can also be encrypted, making static analysis a pain.

In addition, malware can be written to self-terminate or act innocuous if it detects it's running under dynamic analysis. "Malware can detect when they are running in a virtual machine and hide its true behavior. They can maybe check the OS or even check the CPU temperature and determine whether to execute or just hide," he said.

Vasisht detailed two ways to transform binary code into inputs for a predictive NMT model. One is by using code2seq that breaks down source code, and decompiled code, into an AST of representative tokens. The other is Nero, which describes the control flow graph (CFG) of code.

Mandiant engineers looked to both initiatives in creating their function-naming model, he said. As described above, one focused on ASTs, and other CFGs.

"Using code2seq- and Nero-like architectures as an inspiration, we set out to see if we could apply these techniques to malware disassembly by using AST and CFG representations to predict meaningful function and in the process, hopefully reduce the effort surrounding a tedious reverse engineering workflow," Vasisht said.

The engineers used a Linux server with 48 CPU cores, 500GB of system RAM. and eight Nvidia Tesla M40 GPUs with 24GB of memory. The platform was used to run multiple hyper-parameter searches simultaneously – from max AST contexts to output label max sub-tokens – and for training the final model, he said. They used an input dataset of more than 360,000 disassembled functions and annotations taken from 4,000 malicious Windows PE files, some auto-generated from IDA's FLIRT and others from a decade's worth of hand-written reverser annotations from Mandiant.

Mandiant's automated and scalable analysis pipeline showed improvements over the code2seq and Nero models, he said. Now the company needs to consider how it will deploy the model.

"These include using these model predictions with IDA Pro and [the NSA's open-source] Ghidra plug-ins," Vasisht said. "We also envision deploying this model within the malware analyst pipeline. Also, this will enable us to collect feedback about the predictions, also collect some newer annotations so we can iterate and improve on this model in the future."

Future work includes improving the labeling and data quality; using a combined AST and CFG model; and using different mixes of binaries for training the model, he said. ®

Broader topics


Other stories you might like

Biting the hand that feeds IT © 1998–2022