CERN swaps out databases to feed its petabyte-a-day habit
Run 3 reboot provoked challenges for Europe's particle-smashing project
Europe's particle accelerator at CERN spews out around a petabyte of data daily, which means monitoring the computing infrastructure that processes the data is crucial.
CERN's main activities are based on the Large Hadron Collider (LHC), which propels sub-atomic particles around a 27km circuit, 100 meters underground, then smashes them into each other under the guise of eight distinct experiments. Among them is the CMS experiment, which aims to spot the particles responsible for dark matter, among other things.
Like other experiments, CMS shut down for a period of upgrades from 2018 to 2022, and restarted in July last year for the three-year Run 3 period in which scientists will increase the beam energy and sample physics data at a higher rate.
In preparation, four big LHC experiments performed major upgrades to their data readout and selection systems, with new detector systems and computing infrastructure. The changes will allow them to collect significantly larger data samples of higher quality than previous runs.
But Brij Kishor Jashal, a scientist in the CMS collaboration, told The Register that his team were currently aggregating 30 terabytes over a 30-day period to monitor their computing infrastructure performance.
"Entering the new era for our Run 3 operation, we will see more and more scaling of the storage as well as the data. One of our main jobs is to ensure that we are able to meet all this demand and cater to the requirements of users and manage the storage," he said.
"After the pandemic, we have started our Run 3 operations, which creates higher luminosity, which generates much more data. But in addition to that, the four experiments have had a major upgrade to their detectors."
The back-end system monitoring the infrastructure that supports the physics data had been based on the time series database InfluxDB and the monitoring database Prometheus.
Cornell University's Valentin Kuznetsov, a member of the CMS team, said in a statement: "We were searching for alternative solutions following performance issues with Prometheus and InfluxDB."
Jashal said the system had problems with scalability and reliability.
"As we were increasing the detail on our data points we started to experience some reliability issues as well as the performance issue, in terms of how much resources of the virtual machines, and the services being used."
- CERN spots Higgs boson decay breaking the rules
- CERN celebrates 30 years since releasing the web to the public domain
- Galactic anti-nuclei travelers could help illuminate dark matter
- CERN, Fermilab particle boffins bet on AlmaLinux for big science
In search for an alternative, the CMS monitoring team came across VictoriaMetrics, a San Francisco startup built around an open source wide column time series database, via a Medium post by CTO and co-founder Aliaksandr Valialkin.
Speaking to The Register, Roman Khavronenko, co-founder of VictoriaMetrics, said the previous system had experienced problems with high cardinality, which refers to the level of repeated values – and high churn data – where applications can be redeployed multiple times over new instances.
Implementing VictoriaMetrics as backend storage for Prometheus, the CMS monitoring team progressed to using the solution as front-end storage to replace InfluxDB and Prometheus, helping remove cardinality issues, the company said in a statement.
Jashal told The Register: "We are quite happy with how our deployment clusters and services are performing. We have not yet hit any limits in terms of scalability. We now run the services in high availability mode in our Kubernetes clusters, adding another reliability in the services."
The system runs in CERN's own datacenter, an OpenStack service run on clusters of x86 machines.
InfluxDB said in March this year it had solved the cardinality issue with a new IOx storage engine. "For a long time, cardinality was the proverbial 'rock-in-the-shoe' for InfluxDB. Sure, it still ran, but not as comfortably as it could. With the InfluxDB IOx engine, performance is front and center, and with cardinality no longer the problem it once was, InfluxDB can ingest and analyze large workloads in real time," it said. ®