How graph databases took over relationship mapping
It was a thing long before big tech firms used it to work out who knew who, explains AWS
Sponsored Feature Yesterday's web applications were largely built on relational databases. Those familiar table and row-based systems drove everything from ecommerce to online forums and everything in between. But in today's web, it is not just the data in those rows and columns that is important; it is the relationships between them.
To understand this, just look at the likes of Facebook and LinkedIn which made a business of understanding who knows who and who works for who. The entire social media industry thrives on the links between people.
There are other less obvious use cases for relationship-based data analysis such as online advertising. The device you use, your personal interests, the websites you visit and what you do on them are all data points. The relationships between them hold implicit knowledge giving advertisers an almost clairvoyant knowledge about you.
Mining these relationships takes a new kind of database that draws on a very old theory: a graph database.
Graph theory was a thing long before big tech firms used it to work out who knew who. It dates back to the 18th century, when Leonard Euler tried to find a way to analyze the Seven Bridges problem. This puzzle demanded a person to cross the seven bridges of Konigsberg, Prussia, only once. The problem turned out to be unsolvable, but to prove that, Euler had to abstract the bridges and the land that connected them into nodes (the land masses that made up the city) and edges (the connections between them). This was the first documented graph.
Since then, we have created graphs to represent more kinds of relationships. The explosion of data on the web led to new work on graph-based standards to exchange metadata. For example, the World Wide Web Consortium's (W3C) Resource Description Framework (RDF) standard gave us a way to describe relationships between different types of metadata on the web using a concept called triples.
Triples named a subject (say, 'Patsy'), an object ('Brian'), and a predicate. The predicate is the relationship between the subject and the object. The predicate 'wife of' gives us the relationship 'Patsy is the wife of Brian'. Patsy and Brian might have other relationships with other subjects and objects. Putting lots of these relationships together creates a graph. You can infer things about entities in the graph by tracking indirect relationships between them. If Patsy is the sister of Colin, then we can compute that Colin is the brother-in-law of Brian.
RDF is one of two main graph models. The other popular graph model is labeled property graph. Whereas RDF is described as triples in a property graph, entities are called nodes or vertices and the relationships between them are called links or edges. Popular ways to build property graph applications include the openCypher query language and Apache's open-source TinkerPop project.
Finding the value in graphs
This was all very dry academic stuff at first but it quickly became practically useful. As digitization increased, data volumes exploded. The relationships between those data elements became more important.
"Looking at the relationships between your data is where you can create a lot of value," explains Brad Bebee, General Manager for Amazon Neptune. Neptune is Amazon Web Services' (AWS) graph database designed from the ground up to store graph data natively, supporting both the RDF and Property Graph data models.
While you can map relationships between entities in relational databases, it is cumbersome, computationally expensive and can even fail. You must retrieve information about the entities from tables and then unify that information using JOIN statements. The more relationships you want to explore, the more JOINs you need. It becomes unwieldy, especially as the number of entities grows. AWS designed Neptune to handle those queries at scale extending to billions of nodes and edges.
Graph types in Neptune
Storing relationships about customers falls into one of three main graph types that AWS customers typically build with Neptune. These often use an identity graph, designed to hold information about individuals and organizations. Social networks are often related to these kinds of graphs.
Another kind of graph holds knowledge by storing the relationships between information. An example of that might be a catalogue of TV and movie shows with viewer preferences. This kind of knowledge graph can help find specific shows based on a viewer's preferences. Finally, there are fraud graphs to detect fraud. "Fraud is a big problem and has many different techniques," Bebee says, but adds that fraud graphs can provide a new angle of detection; the fraudulent activity shows up in relationships. "We see people using fraud graphs to look for those patterns of relationships and connections in their data."
In India, online gaming company Games24x7 used Neptune to spot fraud in its online rummy game, which involves real cash payments. It analyzes the relationships between the six players in a game to see if there is any collusion
Another fraud graph example was shared at the recent AWS re:Mars conference where the AWS Fraud and Abuse team has a graph with 100 trillion edges, performing class predictions for every node in the graph. With graph neural networks, they were able to detect 10 times as many malicious accounts as a previous rule-based detection method could.
Some graphs span different types. For example, Marinus Analytics, an AI company that focuses on eliminating exploitation used Neptune to create a graph database tracking human trafficking through thousands of online advertisements placed on classified internet sites every day. That is an example of a combined fraud and identity database, says Bebee.
AWS has optimized Neptune for interactive, transactional workloads, he explains. "We focus on high throughput, low latency queries and provide strong transaction semantics," he says. One customer using Neptune is NBC Universal. The company drives social engagement for its online and TV platforms, which requires some heavy relationship-based lifting at speed.
The database supports OLAP use cases too, though. "We also see workloads where people want to do more reporting and analytics," Bebee adds. "Those tend to be more sporadic jobs which is why graphs cover both OLTP and OLAP workloads."
Neptune under the hood
The cloud made graph databases possible at scale, explains Bebee. "Graph applications can be memory and CPU intensive," he explains, adding that public cloud services have reduced the cost of storage and computing. "Cloud computing's scalability means that you don't have to use specialized hardware to run it. You can run it on any hardware that you can consume on a pay per usage basis."
Neptune uses the same storage layer that the company uses for other databases in the cloud, providing benefits including up to 15 read replicas to reduce read latency for time-critical applications. It also features data encryption at rest, and can support data restoration points extending back 35 days.
As graph queries have random access patterns, the main bottleneck with graph databases is the bandwidth available to pull the huge data volumes into memory for processing, Bebee says. The company offers several EC2 instances adapted for databases that including instances powered by the AWS Graviton 2 processor. AWS says that Graviton 2-based instances deliver up to 40% better price performance over comparable current generation x86-based instances for a variety of workloads.
How developers can get involved
Developers transitioning from relational databases to this type of data architecture must rethink their approach. Neptune does not store its data in relational tables. Instead, it stores nodes and edges as sets of four positions called quads.
A quad carries the subject, the predicate, the object, and another value called the graph. The graph value stores either a graph identifier or an edge ID value depending on which graphing framework you are using. It indexes data differently to a relational or document system, using three types of index based on customer need with an optional fourth type.
Neptune would store entities such as customers and postal addresses as nodes, and use edges to associate them. A query might then show customers that live at that address. Other nodes could include device types, web browsing behaviors, and interest areas. The more relationships that connect customers to these other entities, the more powerful the graph database becomes.
Developers working with Neptune must become familiar with different ways to query the graph. SQL will not cut because working with a graph database is beyond creating joins, instead, Neptune supports both the property graph and RDF graph data models. Each has their own query language that enables people to traverse these graphs. RDF graphs can use SPARQL. Property graphs can use openCypher and Apache TinkerPop. While openCypher is not SQL, it is more inspired by SQL syntax, which makes writing graph traversals easier for developers coming across from the relational side.
AWS continues to push more integration opportunities for Neptune. April 2022 saw the announcement of the General Availability of openCypher and in July 2022, AWS announced the available of Amazon Neptune Global Database to support graph use cases that require multi-region capabilities such as business continuity or low latency access for geographic customers.
Going into more detail, Neptune Global Database is aimed at customers who want to deploy a primary Neptune database in one region and replicate their data in up to five secondary read-only database clusters in different AWS regions. This allows customers to move their applications to secondary clusters minimizing application down time. A Neptune cluster can recover in less than one minute even in the event of a complete regional outage. This effectively provides the customer with a Recovery Time Objective (RTO) of less than 1 minute.
Integration with other databases and AI
The graph database might store information differently but that does not stop it integrating with Amazon's other managed databases. It does this using a streams capability, which captures changes to databases and makes them available to others. This allows Neptune to consume streams from other databases and vice versa. It also integrates with OpenSearch for use cases that need to include full-text search and graph queries.
AWS has also enhanced Neptune with support for machine learning, as it has done with other databases. This features extensions in Gremlin and SPARQL to support predictions. It makes the predictions using graph neural networks (GNN), which is machine learning that is purpose-built for graphs, to build models based on a graph's structure.
"With Neptune ML you could issue a Gremlin or a SPARQL query for attributes that don't actually exist in your graph," Bebee says. "They will have been inferred by your graph neural network model". Possible use cases include predicting whether a customer will buy a product. You can also use Sagemaker, Amazon's product for building training models in the cloud. Developers can use this tool to pick which graph data they want to train with.
"You should think about the questions you're trying to answer," concludes Bebee. "Are they best solved by traversing those relationships? Those are the workloads where you'll get the benefit of using a graph database." It will give your data processing an edge - or several billion of them.
Sponsored by AWS.