Prompt engineering is a task best left to AI models

Machine-learning boffins find open source neural nets can optimize their own queries

Large language models have given rise to the dark art of prompt engineering – a process for composing system instructions that elicit better chatbot responses.

As noted in a recent research paper, "The Unreasonable Effectiveness of Eccentric Automatic Prompts" by Rick Battle and Teja Gollapudi from Broadcom's VMware, seemingly trivial variations in the wording of prompts have a significant effect on model performance.

The absence of a coherent methodology to improve model performance via prompt optimization has led machine learning practitioners to incorporate so-called "positive thinking" into system prompts.

The system prompt instructs the model on how to behave and precedes the user's query. Thus, when asking an AI model to solve a math problem, a system prompt like "You're a professor of mathematics" probably – though not always – produces better results than omitting that statement.

Rick Battle, staff machine learning engineer at VMware, told The Register in a phone interview that he's specifically advising against that. "The overarching point of the paper is that trial and error is the wrong way to do things," he explained.

The positive thinking path – where you just insert snippets into the system message like "This will be fun!" – can enhance model performance, he noted. "But to test them scientifically is computationally intractable because you change one thing, and you've got to go rerun your entire test set."

A better approach, Battle suggested, is automatic prompt optimization – enlisting an LLM to refine prompts for improved performance on benchmark tests.

Prior research has shown that this works with commercial LLMs. The downside of doing so is that it can be rather expensive. Conducting this experiment involving 12,000 requests per model using GPT-3.5/4, Gemini, or Claude would have cost several thousand dollars, according to the researchers.

"The point of the research was to discover if smaller, open source models can also be used as optimizers," explained Battle, "And the answer turned out to be yes."

Battle and Gollapudi (no longer with Broadcom) tested 60 combinations of system message snippets, with and without Chain of Thought prompting over three open source models – Mistral-7B, Llama2-13B, and Llama2-70B – with parameters ranging from seven to 70 billion on the GSM8K grade school math dataset.

"If you're running an open source model, even all the way down to a 7B which we were using Mistral for," said Battle, "if you have as few as 100 test samples and 100 optimization samples, you can get better performance using the automatic optimizers which are included out of the box in DSPy, which is the library that we use to do it."

Beyond being more effective, LLM-derived prompt optimizations exhibit strategies that probably wouldn't have occurred to human prompt-tuners.

"Surprisingly, it appears that [Llama2-70B's] proficiency in mathematical reasoning can be enhanced by the expression of an affinity for Star Trek," the authors observe in their paper.

The full system prompt reads as follows:

System Message:

«Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.»

Answer Prefix:

Captain's Log, Stardate [insert date here]: We have successfully plotted a course through the turbulence and are now approaching the source of the anomaly.

"I have no good explanation as to why the automatic prompts are as weird as they are," Battle told us. "And I certainly would never have come up with anything like that by hand." ®

More about


Send us news

Other stories you might like