This article is more than 1 year old
Using SQL techniques in NoSQL is OK, right? WRONG
Don't try and cut corners, or you'll end up in a twisty, tangly mess
Now that we have CQL for Cassandra and N1QL for Couchbase, it seems the power of SQL is being opened for NoSQL database developers.
There is, however, a real danger that developers who see these SQL-like languages as an implementation of SQL are in for a world of pain as they try to leverage their SQL skills on a NoSQL database.
Unless the developer has a correct understanding of the underlying data model, then almost-SQL-like statements will at best be slow, and will at worst produce incorrect results. Worse still, there are things in SQL you just can't do in CQL, and a lot in CQL that makes no sense in SQL. How can you avoid these problems?
It's no surprise to find that the first thing NoSQL database programmers did in the search for speed and flexibility was throw away the SQL language. This gave them what they were looking for, but the price to be paid was a lack of a standard interface, which made it complicated and difficult to learn.
These interfaces did allow a lot of flexibility, but often the interface library was from a third-party developer, which could lead to a lack of support.
The introduction of SQL-like languages helped alleviate the problem, allowing the SQL programmer to transfer knowledge straight to NoSQL databases. Further, the language is developed by the database programmers, giving some sort of support from official sources. So how like SQL are these languages? Using Cassandra's CQL language you can select data with:
Select * from table where key="Name";
And insert data with:
insert into Sensors (name,insertion_time,reading) values (5f388896-71c4-402c-b7c6-8610e305422f,'2015-01-01',35);
Which looks fabulous: you can replace your expensive fibre-channel servers with cheap commodity clusters, transfer the data and enjoy blazingly fast query responses and data inserts. Except, of course, you can't.
Although the open source developers are doing a fantastic job of making their data look like regular relational tables, the data structure underneath will be far different to anything you'll find in SQL server or Oracle – or anything else, really.
There are good reasons NoSQL is so fast (and distributable), mainly that these engines have done away with table joins. As developers, we are used to working with normalised tables and joining them on the fly with primary and secondary keys.
The engine takes care of the joins and a correctly indexed table will be reasonably performant. However, in a NoSQL cluster, the data is likely to be distributed across a number of machines, so joining tables will take time. As a consequence, the databases won't let you do it. Sure you can add a system on top of the data (Hadoop or Spark), or you can roll your own in the application, but you will have to take that performance hit.
All of which leads to one inescapable fact: with a NoSQL system, your design team can't start off with an entity relationship diagram, and modelling data on a graph store, a column store or a key value store is always going to be very different.
Sure, the SQL-like language may make the data look like it's stored tables, but actually it's going to be very different. Besides the little matter of not being able to join tables, this can effect how we can select data.
Most developers will be accustomed to using the SQL "where" statement to select rows that match a certain criteria. CQL, for instance, does implement a where statement, but you can only use it to select rows on the primary and some indexed keys. You can add indexes to columns, but unlike a SQL index, these do not speed up queries, but rather slow them down. They do allow more where queries to be performed, but your performance will suffer.
Worse, the order of conditionals (and things such as statements) is important, and you can only effectively use them on the clustering column. You can get round this by using a keyword (Allow filtering), but this puts a huge burden on the servers – the entire column is read into memory and then filtered. Bang goes your blistering performance.