Sometimes fast just isn’t fast enough and in the fast moving world of NoSQL databases, what was considered blindingly fast yesterday can be seen as slow today. For instance, Cassandra has always been thought of as a fast solution for ingesting data into a database cluster, but today upcoming systems such as Aerospike and Scylla are wiping the floor with Cassandra in benchmarks.
Aerospike claims to be 10 times faster than Cassandra while Scylla reckons one of its three-node clusters can do the work of a 30-node Cassandra setup. Scylla also claims to be a drop-in replacement for Cassandra using the same data model, data storage and query interface language.
Both of these databases take common NoSQL models and rework the implementation taking advantage of modern advancements. In the case of Aerospike, it’s the cheapness of memory and solid state drives, while for Scylla it’s the asynchronous programming models. Both also score advantages by forgoing the Java programming languages and replacing them with lower level languages (C and C++) - thus doing away with Java’s two main speed downsides, heap optimisation and garbage collections. However, these raw figures shouldn’t be taken at first sight, we need to know how benchmark figures are obtained.
Why choose NoSQL?
There are two main reasons for picking a NoSQL database over a traditional relational database. The first is the more flexible model of NoSQL (or in the case of Graph databases a native model that is just hard to implement in relational databases), the second is the need for speed.
Relational databases just can’t keep up with scalable NoSQL systems, without some very specific tuning or code rewrites. In this world, if you need more speed you just keep adding low cost nodes until you can keep up with the incoming data stream. This can be very appealing, generally speaking the servers you pick aren’t expensive highly tuned machines but can be off the shelf (albeit quite a high shelf with top level product) machines, you want more power you add more machines. But things aren’t quite as clear-cut as this. In the high-speed database world, there are two measurements that determine how fast a solution is.
Throughput and latency
Throughput is defined as the number of operations that a system can do in a given time. This might be the number of data items that can be stored, updated or the number of database reads that can be done. Generally the throughput for a given system will be different depending on the mix of these three and it’s for this reason that benchmarking for a system should be done for the given load type you are expecting.
This is why you should be doing your own testing simulating the load you expect using something like Yahoo! Cloud Serving Benckmark (YCSB), and not just accepting manufacturers' marketing examples. Manufacturers naturally tend to emphasise the speed that their solution gives over others and you need to read the methodology carefully if you are relying on these to choose a database. Look carefully at the type of server that’s used, particularly at the amount of memory available. If there’s a lot, there’s a good chance that most of the data in the test never touches the disk making the throughput seem higher than it will be in the real world.
The second important measure is the latency; this is the time it takes for a database to return a result for a given operation. Generally it doesn’t mater how many serves you throw at the cluster, the latency won’t drop and in fact it might increase, making the database appear slower. It goes without saying that if the speed of response is important to your problem then you will need a low latency solution. If the throughput is important, then that’s the benchmark you will be interested in. Of course latency and throughput will interact in complicated ways in a real deployed cluster but the raw figures will at least give you an idea of what you can expect.
What to look for in a benchmark
As you might expect, Scylla and Aerospike publish their own benchmarks (as do the various flavors of Cassandra) and it’s no surprise that both come out on top for one or both of these important figures. Both companies should be commended for using the industry standard YCSB suite for performing the tests and for being upfront with both the technology used and the workloads tested, although it has to be said that Scylla didn’t use the latest version of Cassandra, using instead a version from late 2016.
However, there is an additional word of warning: neither of these sets of tests have obvious published dates making it hard to be sure they are up to date and it’s not clear if the tests are being kept up to date as Cassandra, Scylla and Aerospike evolve.
The bottom line for the speed tests is that Scylla can be 4.6 times faster than Cassandra while having a four to 10 times latency advantage. Aerospike claims to be 10 times faster although that result is only valid for workloads that run in-memory only.
As we all know, speed isn’t everything; maturity and reliability still have to count for something, and Cassandra is the older of the products. Although Scylla is billed as a drop-in replacement for Cassandra, some of Cassandra’s advanced features such as materialised views, hinted handoff (actually an old Cassandra feature) and Lightweight transactions are still on the roadmap or in experimental versions of Scylla.
Aerospike also continues to add new features with the release of version 4 promised in early 2018 but appears to be forging its own path adding features its customers need.
In the end, though, for many of us, the need for speed may be something of a moot point: how many of us really need to ingest data at the rate some of these modern databases manage?
The waters are muddied even further by cloud-based systems from the likes of Amazon (DynamoDB), Microsoft (Cosmos) and Google (Spanner): the future may not be in self-managed clusters, after all.
Whatever system you may decide to look at, just remember: test it yourself on the workloads you expect to be working running. Keep it real, kids. ®