The Great Graph Debate: Revolutionary concept in databases or niche curiosity?
Two experts go head-to-head – then you decide
Register Debate Welcome to the latest in our series of Register Debates, in which writers discuss technology topics, and you the reader choose the winning argument. The format is simple: we propose a motion, the arguments for the motion will run this Monday and Wednesday, and the arguments against on Tuesday and Thursday. During the week you can cast your vote on which side you support using the poll embedded below, choosing whether you're in favor or against the motion. The final score will be announced on Friday, revealing whether the for or against argument was most popular.
This week's motion is:
Graph databases – in which relationships are stored natively alongside the data elements – do not provide a significant advantage over well-architected relational databases for most of the same use cases.
It has been roughly 20 years since the first production deployment of Neo4j, one of the leading protagonists in the graph database story.
Strong market growth and interest from investors suggest it might be catching up with the rows and columns of RDBMSes, owing to its analysis of data according to networked relationships we see all around us: in business, media, society, medicine and science.
But detractors still have their doubts, suggesting that the benefits graph systems seem to offer can be created in relational systems, which have a longer history – and are arguably more mature and easier to manage – than their graph counterparts.
Neo4j was founded by Swedish computer scientist Emil Eifrem in 2000 before introducing its first system into production in 2003. In 2010, Neo4j version 1.0 was commercially released.
Among its users, Neo4j counts NASA, which has used a graph system to help understand the people, roles and skills it would need to overcome its various scientific and engineering challenges.
Money has flocked to the concept. In July 2021, Neo4j secured $325 million in a funding round which valued the company at $2 billion and adds to five earlier funding rounds.
In November last year, Neo4j's fifth iteration was released, promising query language improvements and up to 1,000x faster query performance. Outside the enterprise version, community edition Neo4j remains open source.
Meanwhile, market rival TigerGraph has been staking its claim. In February 2021, it secured $105 million in funding to add to the $65 million stash already raised. It counts automotive manufacturer Jaguar Land Rover among its customers and added cloud management and ML workbench features last year.
The potential for continuing growth is there, though. During an industry keynote in 2021, Gartner analyst Rita Sallam forecast that 80 percent of data and analytics innovations will be made using graph technology by 2025. Philip Carnelley, AVP of software research at IDC Europe, has said usage and investment in graph would grow rapidly among European companies.
Neo4j and TigerGraph have been joined by a growing roster of vendors in the market. Ontotext has GraphDB, and there is also the open source graph database Memgraph.
But vendor claims of graph database domination come with a health warning. As a whole, the segment might be worth $651 million, or 1.4 per cent of the $46 billion total database market value.
Nonetheless, doubts have remained that graph databases will, in the long term, offer advantages over RDBMSes. In 2015, a group from University of Wisconsin argued that a syntactic layer for querying graph relationships in an RDBMS is "competitive to these specialized engines."
"Given that RDBMSes are ubiquitous in enterprise settings, and have a robust and mature technology that has been hardened over decades, and are part of existing administrative methods in place, we argue that it is time to reconsider if specialized graph engines have a role to play in most enterprises," the authors said [PDF].
Stalwarts of the database sector have not stood idle. For example, Oracle Spatial and Graph is an option for Oracle Enterprise Edition and include Oracle Network Data Model (NDM) graphs, which are built on the Oracle RDBMS for graph-like queries. Apache Age offers a graph extension to the popular and growing open-source RDBMS PostgreSQL. AWS has its own graph database service dubbed Neptune.
Kicking off the debate, arguing FOR the motion, is Andy Pavlo, an associate professor of databaseology at Carnegie Mellon University and co-founder of OtterTune.
'Graph DBMSs garner more attention and mindshare than is warranted'
Recently there has been a lot of academic and industry interest in graph databases and their related ilks, such as the Resource Description Framework (a semantic web standard) and triplestores. This is because many developers use knowledge graphs for modeling relationships in their applications. For example, social media applications inherently contain graph-oriented relationships (e.g., "likes", "friend-of"). Given this, we have seen the advent of graph-oriented DBMSs in the last two decades. These systems either target operational workloads (Neo4j, Drgraph) or analytical workloads (TigerGraph, JanusGraph).
But we contend that this interest is misguided: graph DBMSs garner more attention and mindshare than is warranted. These systems ignore many of the hard-learned lessons on data management from the last 50 years. As we now discuss, the graph DBMSs are fundamentally flawed and, for most applications, inferior to relational DBMSs.
We first note that the idea of natively storing databases in a graph-oriented manner is not new. CODASYL was a network (graph) data model proposed in the 1970s for querying and updating a database. Modern graph DBMSs inherit almost the same problems as their CODASYL predecessors. For example, they provide a low-level access language that lacks data independence. This design approach makes schema changes difficult, as it requires the application to maintain multiple versions of records in the database manually. It also makes virtual graphs (i.e., views) more challenging because the graph's structure (i.e., its contents) is unknown before executing a query. In summary, data independence is more difficult to support in graphs than in relations, and all graph DBMSs suffer from this problem.
This limitation alone should be a dealbreaker for any sensible practitioner. But developers continue to think that graph DBMSs are better for graph-data problems than relational DBMSs. This likely is because, in addition to graph-native storage, these DBMSs also support graph-oriented query languages (e.g., Gremlin, SPARQL, Cypher).
But it is straightforward to model a graph (using SQL) as a collection of tables:
Node (node_id, node_data) Edge (node_id_1, node_id_2, edge_data) >
A relational DBMS traverses edges in a graph through joins. A translation layer on top of relations can support graph-oriented APIs that reduce the number of client-server roundtrips for traversal operations. For example, Apache AGE is a graph translation layer for PostgreSQL, and Amazon Neptune is a graph-oriented veneer on top of their Aurora MySQL offering. Some relational DBMSs, including Microsoft SQL Server and Oracle, provide built-in SQL extensions that simplify storing and querying graph data. With these systems, applications benefit from improved query execution through graph APIs while retaining support for SQL and its extensive ecosystem.
Thus, the question for graph DBMS vendors is whether they can make their graph storage fast enough to overcome the disadvantages noted above. But over the last decade, there have been several performance studies of native graph databases versus a graph simulation on relational DBMSs [1, 2, 3, 4, 5]. In all cases, the relational DBMS solution was preferable. ®
Cast your vote below. We'll close the poll on Thursday night and publish the final result on Friday. You can track the debate's progress here.