Cloudflare broke its logging-a-service service, causing customer data loss
Software snafu took five minutes to roll back. The mess it made took hours to clean up
Cloudflare has admitted that it broke its own logging-as-a-service service with a bad software update, and that customer data was lost as a result.
The network-taming firm admitted in a Tuesday post that, for roughly 3.5 hours on November 14, its Cloudflare Logs service didn't send data it collected to customers – and about 55 percent of the logs were lost.
Cloudflare Logs gathers logs generated by the cloud services and sends them to customers who want to analyze them. Cloudflare suggests the logs may prove helpful "for debugging, identifying configuration adjustments, and creating analytics, especially when combined with logs from other sources, such as your application server."
Cloudflare customers often want logs from multiple servers and, as logfiles can be verbose and voluminous, the provider worries that consuming them all could prove overwhelming.
"Imagine the postal service ringing your doorbell once for each letter instead of once for each packet of letters," the post suggests. "With thousands or millions of letters each second, the number of separate transactions that would entail becomes prohibitive."
Cloudflare therefore uses a tool called Logpush to bundle logs into bundles of predictable size, then push them to customers with a sensible cadence.
Logs that Cloudflare provides to customers are prepared by other tools called Logfwdr and Logreceiver.
On November 14, Cloudflare made a change to Logpush, designed to support an additional dataset.
It was a buggy change – it "essentially informed Logfwdr that no customers had logs configured to be pushed."
- Cloudflare beats patent troll so badly it basically gives up
- Cloudflare tightens screws on site-gobbling AI bots
- Malaysia's plan to block overseas DNS dies after a day
- Time Lords decree: No leap second needed in 2024
Cloudflare staff noticed the problem and reverted the change in under five minutes.
But the incident triggered another bug in Logfwdr that meant, under circumstances like the Logpush mess, all log events for all customers would be pushed into the system – instead of just for those customers who had configured a Logpush job.
The resulting flood of info is what caused the outage, and the loss of some logfiles.
Cloudflare has admonished itself for the incident. It conceded it did most of the work to prevent this sort of thing – but didn't quite finish the job. Its post likens the situation to failing to fasten a car seatbelt – the safety systems are built in and work, but they're useless if not employed.
The networking giant will try to avoid this sort of mess in future with automated alerts that mean misconfigurations "will be impossible to miss" – brave words. It also plans extra testing to prepare itself for the impact of datacenter and/or network outages and system overloads. ®