Great Graph Database Debate: Abandoning the relational model is 'reinventing the wheel'
Let's talk about what well-architected looks like
Register Debate Welcome back to the latest Register Debate in which writers discuss technology topics, and you the reader choose the winning argument.
The format is simple: we propose a motion, the arguments for the motion ran on Monday and today, and the arguments against on Tuesday and tomorrow. During the week you can cast your vote on which side you support using the poll embedded below, choosing whether you're in favor or against. The final score will be announced on Friday, revealing which argument was most popular.
It's up to our writers to convince you to vote for their side.
This week's motion is:
Graph databases – in which relationships are stored natively alongside the data elements – do not provide a significant advantage over well-architected relational databases for most of the same use cases.
Arguing FOR the motion once again, with a counterpoint to yesterday's assertions from Jim Webber, is Andy Pavlo, associate professor of databaseology at Carnegie Mellon University.
It's all fun and games until things get ... Relational
My esteemed colleague is being coy in claiming that he does not know what a "well-architected" DBMS means. Nevertheless, allow me to remind him of the key characteristics of such a system. I will focus on analytical queries over graphs, as this is what graph DBMSs claim to be better at than relational DBMSs. My list below is also inspired by this CIDR 2023 paper [PDF] by researchers at CWI.
- Fast Scans on Typed Data – The NoSQL movement misguided developers in thinking that a schemaless (i.e., schema-optional) database is a good idea. It is necessary for some scenarios, and most relational DBMSs now support JSON data types. But for ripping through data quickly, if the DBMS is aware of the schema and the data's types, it removes indirection and improves scan performance.
- Columnar Storage – Storing data in a column-oriented manner has several benefits, including reducing disk I/O and memory overhead due to skipping unnecessary columns for a query. Columnar data also has lower entropy than row-oriented data, making it more amenable to compression schemes that do not require the DBMS to decompress it first.
- Vectorized Query Execution – Using a vectorized processing model [PDF] (as opposed to a tuple-at-a-time model) improves DBMS performance for analytical queries by 10-100x. Processing data in batches also allows the DBMS to leverage vectorized CPU instructions (SIMD) to improve performance further.
- Explicit Control of Memory – The DBMS needs complete control over its memory allocations. This means not using a "managed" memory runtime (eg, JVM, Erlang) where fragmentation and garbage collection will cause performance problems, nor should it allow the OS to determine what data to evict from the cache (eg, MMAP). The DBMS must also have fine-grained data placement to ensure that related information is located close to each other to improve CPU efficiency.
Andy Pavlo: "The mistake they made was to ignore database history and try to reinvent the wheel by abandoning the relational model"
There are some additional optimizations that are specific to graphs that a relational DBMS needs to incorporate:
- Graph-centric Query APIs – The SQL:2023 standard introduces Property Graph Queries (SQL/PGQ) for defining and traversing graphs in a relational DBMS. SQL/PGQ is a subset of the emerging GQL standard. Thus, SQL/PGQ further narrows the functionality difference between relational DBMSs and native graph DBMSs.
- Query Execution Enhancements – A well-architected relational DBMS will want to include enhancements specifically for optimizing graph queries, including multi-way worst-case optimal joins (WCOJ), compact ephemeral data structures (e.g., compressed sparse rows), and factorized query processing. Although adding these requires non-trivial engineering, such enhancements fit nicely with existing relational DBMS execution engines and do not require a new engine to be written from scratch.
As evidence for how a well-architected relational DBMS performs against a graph DBMS, I refer to the CIDR 2023 paper from CWI [PDF]. They extended the DuckDB embedded analytical relational DBMS to add support for SQL/PGQ and the above enhancements. Then, they compared it against a leading graph DBMS on an industry-standard graph benchmark. Their results show that DuckDB, with the above extensions, outperforms the native graph DBMS by up to 10x. These are state-of-the-art results from January 2023 and not from five years ago.
And although the CWI team includes some of the best database systems researchers in the world, they have not raised hundreds of millions of dollars to achieve these results.
Regarding my colleague's reference to Stonebraker's seminal 2006 paper that refutes "one size fits all" DBMS architectures, their anecdote that Neo4j arose from their attempt to use a relational DBMS in the 2000s for graph-centric workloads is evidence of Stonebraker's argument. But the mistake they made was to ignore database history and try to reinvent the wheel by abandoning the relational model. I encourage interested readers also to read Stonebraker's 2006 treatise [PDF] on the failure of alternative data models to supplant the relational data model since its invention in 1969. Graph databases are just another category of non-relational DBMSs that will eventually decline in popularity as relational DBMSs adopt the best parts of them (as they have with others in the past).
Lastly, I stand by my 2021 public wager about the future of the graph database market. I will replace my official CMU photo with one of me wearing a shirt that says "Graph Databases Are #1". I will use that photo until I retire, get fired, or a former student stabs me. ®
Cast your vote below. We'll close the poll on Thursday night and publish the final result on Friday. You can track the debate's progress here.