The Great Graph Database Debate: Relational can't do everything
DBs aren't just about theory – they are complex systems designed to perform demanding tasks
Register Debate Welcome back to the latest Register Debate in which writers discuss technology topics, and you the reader choose the winning argument.
The format is simple: we propose a motion, the arguments for the motion run on Monday and Wednesday, and the arguments against run today and Thursday. During the week you can cast your vote on which side you support using the poll embedded below, choosing whether you're in favor or against. The final score will be announced on Friday, revealing which argument was most popular.
It's up to our writers to convince you to vote for their side.
This week's motion is:
Graph databases – in which relationships are stored natively alongside the data elements – do not provide a significant advantage over well-architected relational databases for most of the same use cases.
It has been roughly 20 years since the first production deployment of Neo4j, one of the leading protagonists in the graph database story.
Strong market growth and interest from investors suggest it might be catching up with the rows and columns of RDBMSes, owing to its analysis of data according to networked relationships we see all around us: in business, media, society, medicine and science.
But detractors still have their doubts, suggesting that the benefits graph systems seem to offer can be created in relational systems, which have a longer history – and are arguably more mature and easier to manage – than their graph counterparts.
Jim Webber, our first contributor arguing AGAINST the motion, is Neo4j's chief scientist and a visiting professor of computer science at Newcastle University.
Relational cannot deal with every use case
We agree with many of the house's assertions about the usefulness of graph APIs. The crux of our disagreement is simply with the claim that some future "well-architected" relational database engine could render the use of today's useful, existing, in-production graph databases unnecessary. The adjective "well-architected" does a lot of heavy lifting here to make that speculation as barely credible as it is.
We should also point out that the idea that relational can deal with any use case and is a universal database option, is spurious. The notion of "one size fits all" has been dismissed [PDF] by one of the most influential theorists of the relational school, Michael Stonebraker. Stonebraker argues that distinct workloads, even within the purview of relational databases, require different data engines to work well. Within Neo4j we are happy to embrace this specialization as the graph database has a different engine from the graph analytics platform. In fact, there is a rich tradition of doing so in Document, Time series and, yes – relational databases [PDF].
We cannot accept that relational can do anything. Neither can we accept "[graph] systems ignore many of the hard-learned lessons… from the last 50 years" as if graphs exist outside of day-to-day operational data management. Yet graph databases have consistently tackled transactions, query planning/execution, indexing, consensus, replication and concurrency control using a mixture of tried and tested techniques.
We're glad that the house sees "benefits from... graph APIs" and query-oriented query languages. This is why we built Cypher, a declarative language [PDF] with formally described semantics. We are engaged with academia and industry to define GQL, an ISO-standardized graph counterpart to SQL based on Cypher, by the same ISO committee that standardized SQL. GQL will allow huge numbers of new users to succinctly and humanely express graph queries that are cumbersome and error-prone in SQL.
We reject the assertion that graph databases cannot properly support views and migrations, and therefore lack (a narrow definition of) data independence. Migrations are frequently performed in practice (as in this example), while GQL includes native syntax for defining graph views. In fact, the schema-optional nature of many graph databases and the fuzzy pattern-matching abilities of their query languages means they are better off than others with respect to data independence.
But databases aren't just about theory. They are complex systems designed to perform some of the most demanding tasks in computing. It is not sufficient to offer vague notions of "well-architected" implying previous databases have been poorly architected. Many thousands of graph production systems in use today certainly haven't chosen graphs out of ignorance, but because the graph model and performance was the best engine for their specific requirements.
As to the claim that you can just write SQL that does graph work, heterogeneous graphs have relationships between nodes that are many, varied, and bi-directional. They are sometimes regular, sometimes not; they are sometimes sparse and sometimes dense. The house's notion that graphs should be modelled as a collection of tables isn't practical. We know this only too well because this is how Neo4j began: by using a graph API atop a relational database, we ended up going against the grain with exploding complexity and dwindling performance. It is that precise technological gap that forced us to build an engine that could process graphs natively, not a perverse will to build our own database!
The fact is that for graph use cases common in the wild, native storage engines often have very significant run-time benefits, which include increased data locality, better concurrency support and reduced space usage to name just a few. Some cherry-picked studies from 5+ years ago which largely focus on analytical workloads over homogeneous graphs do not change this reality.
We have had 50 years to understand relational databases. We have come to respect their utility and understand their limitations. Accordingly, we argue that the "well-architected" processing engine proposed by the house is already here. And it's called a graph database. ®
Cast your vote below. We'll close the poll on Thursday night and publish the final result on Friday. You can track the debate's progress here.