Discord details how it dodged latency with a super-disk made in the cloud
For when a GCP Local SSD is just not quite reliable enough
Chat platform Discord delivered a playful slap to Google yesterday with a post describing how the company dealt with "reliability issues" to achieve some impressively low latency.
Discord deals with 4 billion messages sent through the platform per day by its millions of users. The company runs a set of NoSQL database clusters (powered by ScyllaDB) but its real-time nature means that the databases need to respond to queries as quickly as possible.
"The biggest impact on our database performance is the latency of individual disk operations, how long it takes to read or write data from the physical hardware," said Glen Oakley, a senior software engineer at Discord.
Below a certain query rate, all is good. "Our databases do a great job of handling requests in parallel," said Oakley.
However, at some point you will hit blocking issues, where the database has to wait for an outstanding disk operation to complete before starting another. Things slow down, and users notice. The queries might time out before reaching the top of the queue.
One might have thought that slinging the Local SSDs on offer from GCP would deal with the problem. Oakley noted that the NVMe-based storage had incredibly fast latency profiles, but "in our testing, we ran into enough reliability issues that we didn't feel comfortable with depending on this solution for our critical data storage."
- Microsoft Azure cloud region settles over desert in Doha, Qatar
- Economic uncertainty can't stop cloud growth
- Google Cloud expands to Thailand, Malaysia and New Zealand
- Electrical explosion at Google datacenter injures three
Another option was persistent disks, storage that could be attached or detached when needed, replicated, and connected via the network. So nowhere near as low latency as a directly attached disk.
So what to do? The team wanted to stick with GCP and prioritize low-latency disk reads, but did not want to sacrifice existing uptime guarantees. They also needed to be able to survive a bad sector on an SSD. The solution was to use GCP's Local SSDs for low-latency reads while still writing to the Persistent Disks to take advantage of snapshotting and redundancy via replication.
After faffing around with various caching options in software (Discord runs Ubuntu on its database servers), the team settled on
md and a tricked-out RAID configuration. RAID0 (which just splits raw data over disks – lose one, lose 'em all) was selected for the Local SSDs and a RAID1 (basically a mirror) between the Persistent Disk and RAID0 array.
The result was, more or less, the super-disk success hoped for, although Oakley noted there were some specific edge cases encountered in the cloud environment. "In retrospect," he said, "disk latency should have been an obvious concern early on in our database deployments.
"The world of cloud computing causes so many systems to behave in ways that are nothing like their physical datacenter counterparts."
Something to keep in mind during your company's charge to the cloud. ®