Going with the flow makes AI better at solving coding problems

Careful prompting can beat training a model from scratch

Interview Commercial large language models' abilities to solve competitive programming problems can be significantly boosted by carefully guiding its processes through clever prompt engineering.

To demonstrate this, Codium AI, based in Israel, built AlphaCodium and released the software on GitHub this month. AlphaCodium is not a large language model per se. Instead it's a method that improves the problem-solving abilities of generative AI tools like GPT-4 by using what CEO Itamar Friedman calls "flow engineering."

First, a programming question is fed to the underlying large language model, and it's asked to describe and summarize the problem. That information then guides how it should begin to solve the problem. AlphaCodium defines things, like what the inputs and outputs should be, when coming up with a solution. All of this is specified in natural language.

The model then begins to generate code that aligns with the specifications it just described. Programming competitions asking contenders to code to spec typically provide tests showing what a script should output for a given input. AlphaCodium generates more of these test cases, and then it runs though possible solutions to check if the code is working as expected.

If it doesn't manage to match any of the outputs defined in any of the tests, the model generates different solutions until they pass all the tests or it fails. Errors can arise when its code doesn't compile or is just wrong.

You can see the different steps in the flow engineering process in the diagram below. It is largely split into a pre-processing phase, where the system analyzes the problem in natural language, and a code iteration stage, where it runs possible solutions against public and AI-generated tests.


All the broad steps that guide AlphaCodium into generating code to solve problems

"We don't take the problem and go to the model and tell it, 'Hey, please generate the final solution,'" Friedman told The Register. "We ask the model to please redefine this problem in bullet points." Simplifying it and breaking things up into chunks makes it easier for the model to later generate code for different parts of an algorithm.

Essentially, flow engineering is a procedure that guides the model's problem-solving process by splitting it into well-defined steps. Prompting it to "divide the generated code into small sub-functions, with meaningful names and functionality," we're told, leads to fewer bugs and makes the code easier to test and fix.

"We basically spent 95 percent of our time on flow engineering, and only 5 percent on prompt engineering and we did not change the prompts for each [step]," Friedman added.

Engineers from Codium tested their model's performance on hundreds of problems used in the verification and test parts of the CodeForces data set compiled by Google DeepMind two years ago. They claim that AlphaCodium was better at solving coding problems than Google DeepMind's AlphaCode and AlphaCode2 models.

In results reported in an arXiv paper [PDF], AlphaCodium was able to correctly answer 44 percent of the questions compared to AlphaCode's 24 percent, while generating only five solutions compared to AlphaCode's ten chosen solutions for 107 validation problems. Interestingly, the gap narrowed when it came to 165 test problems with AlphaCodium solving 29 percent compared to AlphaCode's 28 percent.

AlphaCode selects the ten most promising solutions out of tens of thousands, or hundreds of thousands, of possible scripts it generates – making it computationally intensive to run.

"We focused much more on the entire flow of testing," Friedman said. "For [Google], they did so much work on the generation. They try to generate hundreds of other options and we generate very few solutions, but test them really well to guide the improvement of the code."

AlphaCodium is a tiny bit better than Google DeepMind's latest AlphaCode2 model that is 10,000x more efficient than its predecessor AlphaCode, he added.


How AlphaCodium compares to other state-of-the-art models in terms of accuracy and efficiency

Friedman said he was confident that AlphaCodium's performance isn't due to data leakage, where the underlying model has been trained and tested on the same problems. The GPT-4 version powering AlphaCodium was trained on text scraped from the internet up until September 2021, whereas the problems it tested its system on were taken from the aforementioned CodeForces data set that was released much later.

A better apples-to-apple comparison that assesses the flow engineering process, however, is looking at GPT-4's ability to solve those same questions with and without applying AlphaCodium. Plain old GPT-4 could only correctly answer 19 and 12 percent of problems in the validation and test sets respectively, compared to the AlphaCodium-powered variant's 44 and 29 percent.

In short, it appears that implementing a careful pipeline that generates additional data to guide how code is generated and improve the testing process can be more effective than trying to train a large language model from scratch.

Codium recently released a new tool to support Python developers, who can now call AlphaCodium to directly solve a coding problem in their IDE. You can play with it here. ®

More about


Send us news

Other stories you might like