LLM-driven C-to-Rust. Not just a good idea, a genie eager to escape

Automatic for the people? Don’t mind if we do

Opinion Rust changes worlds. The iron ore we mine to feed the industrial age started out as iron atoms dissolved in oceans two billion years ago. Then photosynthesis happened, pouring out oxygen that rusted that iron out of the water into the solid minerals we've found so useful today. Much the same is happening with Rust the programming language, as it becomes the mechanism of choice for turning prehistoric C code into secure, performant material fit for the future.

One of the modern entities playing the role of ancient bubbling slime is DARPA, the Defense Advanced Research Projects Agency, the American agency that worries about the future of warrior tech. It knows as well as anyone how fallible software can harsh the martial mellow. It very much wants to clean up C code. To that end, it has proposed using machine learning to analyze the stuff and ladle it out as buckets of Rust.

The thinking is sound. General purpose large language model (LLM) tools like ChatGPT and Gemini do a surprisingly good job as they stand, so a specialized tool trained and tuned for this one task is an attractive area to investigate. There's still no real understanding of LLMs' tendency to hallucinate, but that's not exactly unknown in human developers and everyone copes. As the old saying goes: Berkeley gave us Unix and acid, and that's no coincidence.

More soberly, assuming that the technology works, there is one class of problems it won't be able to deal with: what if the source code isn't available? You can't dream that up on a silicon trip. The good news is, there's no need to do so. Decompilation is a process of taking an executable binary and reconstructing a version of the source code that can be examined, edited, and recompiled. It's quite an intensive forensic process; compiled code is usually stripped of human-readable labels, names and comments. It takes a lot of experience and time to reverse-engineer those back into raw decompiled code. Not so much of a problem for an analytic tool that doesn't much care about what things are called, but what patterns they fall into.

Things are made easier by the way compilers produce compiled code. They build their output from standard blocks in standard ways, meat and drink to a model trained on large amounts of data with those things in common. It is at the very least intriguing to think of a C-to-Rust tool with a decompilation front end. It is more fun still when you think that the same idea will work for code written in any language, with the right training. Turing machine equivalence isn't just a good idea, it's the law.

Let's not stop there. Let's add another mature, widespread technology - Just In Time or JIT. It's what turns the JavaScript your browser consumes to the executable binary version your processor runs, and is similarly part of emulation and instruction set translation layers. Normally, developers run the compilation process on their computers and distribute the executable: JIT moves that to your own machine. Adding this to a decompiling Rustifier creates a security amplifier that doesn't rely on anyone else deciding to do the work. It doesn't matter how proprietary, old or obscure the code is, this will open it up, rebuild it more safely, and let you get on with things.

There are reasons to think this will never be practicable, reasons to think that it could, and two good unanswered questions if this does work. The obvious arguments against attempting it are reliability and resources: can an LLM be trusted with security-critical code when we don't know how it works and, in this use case, won't understand the results whether they're good or bad? Causes for optimism here are the restricted scope of the problem and the specificity of the training data.

Resources are tricky. Decompilation and recompilation can run even powerful systems into the dust. There are many, many architectural and implementation techniques to speed this up: JIT has gone from unusable treacle to invisibly swift. Also, if there's one thing the world is not lacking, it's AI accelerator engines. Nobody knows how well LLM-driven decompilers will work. It's no surprise, given the importance of decompilation to threat analysis, that people are starting to work that one out.

Which leaves just two really good questions. Is it legal, and where does it end? The legal issue is like the ongoing and as yet undecided matter of whether the IP in training data extends to an LMM's output. Big Tech says no. But this is far spicier, in that it is in part a machine for turning closed source into open source. Big Tech will not like that. Big Tech may not be able to do anything about it, however. Hey, bro, we hear you like disruption.

The last and best question is where does this lead? Automating and democratizing the creation and application of security patches is cool enough in itself. What the underlying technology is doing, however, is simultaneously turning everything into open source while removing the one huge barrier to open source's true potential. FOSS grants everyone the power to change software to behave as one wants and needs, unbeholden to decisions other people make. That only works if you're a skilled programmer who understands tool sets. There aren't that many of those.

This as-yet imaginary tool, built out of very real components, changes all that. A robot that can wrap an LLM around code to unpick it, rewrite it and rebuild it can make many changes at the prompting of an unskilled user. Get rid of unwanted options and change behavior. That could be as simple as making menus look and feel as you like, or as interesting as removing the ability of a package to send data back to a third party. Or… it's difficult to foresee the consequences of such a huge granting of powers to ordinary people.

As a thought experiment it's a doozie. As a target achievable through existing and in-reach technologies, it's a game-changer that rewrites the relationship between people and machines – and the companies that seek to control both. When we say disruption, bro, we mean it. ®

More about

TIP US OFF

Send us news


Other stories you might like