A group of operating systems specialists has said that sloppy benchmarking is harming security efforts by making it hard to assess the likely performance impact of security countermeasures.
The researchers from the Netherlands and Australia, decided to take a look at the accuracy of security researchers' systems benchmark. As they explain in this paper at arXiv, security papers are littered with so-called “benchmarking crimes”.
The Register spoke to Gernot Heiser, a long-time researcher in trustworthy systems at Australian research center Data61, a professor at the University of New South Wales and also co-founder of OK Labs, which developed one of the world's first “provably secure” microkernels.
Heiser became interested in benchmarking because “I got annoyed by common deficiencies” in how security researchers validate system performance. His Dutch colleagues shared his irritation.
As Heiser explained to The Register, bad benchmarks are more than an irritant because for any security solution “you need to show two things – that the mechanism is effective; that it prevents certain classes of attacks; and that it's useable, because it doesn't impose an undue overhead.”
That makes “benchmark crimes” (a colourful rather than literal term) important, because they can make a promising fix unusable in the real world.
In their analysis of 50 papers published between 2010 and 2015 (in Usenix Security, as well as IEEE's Security & Privacy, the ACM's CCS, and papers accepted by the NDSS symposium), the researchers say they identified 22 categories of “benchmarking crimes”, ranging from ignoring performance impacts altogether, “creative overhead accounting”, using misleading benchmarks, all the way through to presenting only relative numbers in a benchmark.
Most often, Heiser said, the crime is that “evaluation data is not complete enough … you look at the 'cost' of the mechanism in a scenario, without doing a thorough evaluation of the performance effects in a representative set of scenarios”.
Take, for example, a researcher running runs the SPEC suite on systems with and without their security solution. “The suite is designed to represent a broad class of use-cases” he said, but “SPEC only makes sense if you make all the individual programs to come up with the score”.
Cherry-picking SPEC results means they're less effective: “you might pick predominantly CPU-intensive processes and ignore memory-intensive processes,” he said.
Heiser said the prevalence of benchmarking crimes is partly a symptom of the complexity of modern systems: authors might be sloppy or careless, but equally, they might have trouble understanding the implications of their own work.
“That takes a fair degree of expertise,” he said, such that even the people peer-reviewing papers don't notice the problem.
“The upshot is that you get too optimistic a picture of what you can do against a particular attack.” ®