Interview The latest version of Neo4j's graph database – 4.0 – touts new scaling features and better security. The Reg talks to self-confessed graph fanboy Dr Jim Webber about how the graph-wrangler is, at last, able to scale to accommodate large databases, and about its biggest enemy: the inertia of developers who stick with SQL no matter what.
A graph database stores data as nodes, properties and relationships, making it well-suited for queries that explore relationships, and which would be more challenging or complex to construct in relational databases. Neo4j is the most popular graph database, according to the DB-Engines ranking, but still languishes in 22nd place overall – the same position it occupied a year ago.
Dr Webber has been chief scientist at Neo4j since 2010; before that he was a director at software consultancy ThoughtWorks. We ask him why it is that graph databases have not yet taken off to the same extent as other NoSQL approaches?
Graph database biz Neo4j doubles total funding courtesy of $80m E-series splurgeREAD MORE
"I don't know," he confesses. "When I look back at the time when I was working with Martin Fowler and the ThoughtWorks folk, there were so many projects where we shoehorned data into a relational database and made it do all kinds of unnatural things. In retrospect, it would have been so much easier to store those as graphs. The technology wasn't available then. I think it's going to end up that everyone sees it as obvious, but it feels like we had to do the best part of a decade to get that momentum going."
For the uninitiated, what kinds of problems are graph databases suited to handle?
"Graphs are a general-purpose data model, in the same way that relational was a general-purpose model a generation ago," says Webber. "My first use case for graphs was retail recommendations. What stuff is compatible with that stuff you bought? What have people in your surrounding network bought? Anything social tends to work, but I've found dozens of things that turn out to be a graph problem. Power networks, anything with networking in it, routing traffic. Healthcare is a series of events and prescriptions and interventions. A supply chain is a graph. Knowledge is a graph. It's good for identity and access management. Graphs are very applicable in a wide range of use cases."
Despite all that, graph databases are still niche compared to relational or even document databases. What holds the technology back?
"I think we started off as a niche," says Webber. "I have to bristle at your term 'niche'. Obviously I'm a graph fanboy, but you're right, we started from a smaller place. One of the problems graphs had is that although it's a simpler model compared to relational, you have to learn some stuff. There is nodes with labels and properties and relationships with names and direction of properties. That takes getting used to. The barrier to entry for graphs is slightly higher. If you are going to use Neo4j, you have to learn Cypher [query language]. It's not hard for SQL developers, but it's new."
Neo4j 4.0 has several key new features. One is scalability, with sharding (horizontal partitioning) to remove size limitations. Second, new granular security is said to give more control over security and privacy. Third, multiple databases can now run in a single Neo4j cluster. Finally, a new "reactive architecture" gives developers more control over how data is retrieved.
Why so much attention to scalability? "Prior to 4.0 we were a single image replicated database," says Webber. "If you were an architect evaluating your project, you might worry that the replicated nature of Neo4j wasn't going to give you enough room for growth. Even though Neo4j is really quite powerful on a minimum hardware footprint, that perception has become important for us. Neo4j 4.0 helps with that, it's our step into splitting data across multiple spindles across multiple servers and being able to scale it horizontally. If we are going to be able to capture clickstream data from a large retailer, for example, we have to be able to scale horizontally because those are huge."
Neo4j graph database boss: 'The mainstream is always under attack'READ MORE
What is meant by reactive architecture? "Historically in Neo4j, you would fire off a query and the database would return results. Neither the client nor the database had any control over the rate at which that query was produced or consumed. The reactive API just meant that a developer gets to control the rate at which results come back. You're not building up buffers everywhere and wasting resources."
As we finish, Webber returns to the theme of what holds back graph databases, and Neo4j in particular. "That's inertia," he says. "Who's Neo4j's competitor? Not Amazon Neptune or Microsoft CosmosDB. It's Oracle, it's SQL Server, it's that MySQL box that's just good enough. Being able to show the developer community that there's a better, easier way is our biggest inhibitor, that's a big job that's ongoing for us. Our challenge is category education and getting the market to understand the value."