Analysis Much has changed at Microsoft since Steve Ballmer described Linux as "a cancer" in reaction to the open-source flag-flyer's threat to Redmond's money-spinning Windows business.
Three years after Redmond's researchers published their whitepaper on distributed graph engine Trinity, Microsoft has announced that it has released the technology – now named Graph Engine – on an open-source basis under the MIT licence.
Why? Graph Engine lands in an odd, open-sourcey kind of place. According to DB-Engines, the most popular graph DBMS by some magnitudes is Neo4j, followed by OrientDB and Aurelius's TitanDB graph databases, which both have roughly 15 per cent as many users.
OrientDB and TitanDB have both been dogged by complaints that they are not production-ready, yet they are both in use. Alongside Neo4j, they are essentially open-source NoSQL databases, which Graph Engine arguably is too, although it has not been designed to do the things that graph databases are known for doing.
Known for doing?
Graph databases are databases (duh) specialised to perform complex queries on highly interconnected data. In theory, they cater to queries multiple levels deep for which multi-way
JOINs in relational databases would be prohibitively computationally expensive – although performance in terms of how many levels may actually be traversed varies between graph database offerings.
All graph databases are, by definition, made up of the lines, or edges, connecting the data items within them. These would be considered nodes by graph theory proponents (which almost all graph database users are) and are roughly equivalent to rows in relational databases. Both the edges and nodes contain properties, which function like key-value pairs for the purposes of querying data.
But while all graph databases are constituted of the relationship between data items, the graph databases currently available on the market are quite different in the workloads they're optimised for. Some, such as Neo4j, employ a single data model that's been optimised, while others like OrientDB use different data models.
This is where Graph Engine offers something different.
Late, but different
Microsoft's Graph Engine has not been tuned to query or store data. Nor is it ACID transaction compliant – like most other graph databases. Rather, Graph Engine lets you crunch analytical workloads, including online transaction processing using the memory of a distributed system.
As one researcher from the Graph Engine team described it to The Register, Graph Engine is a "distributed in-memory data processing engine" with "a strongly typed distributed key-value store" as the storage backend.
In other words, the computation takes place across a cluster of machines – a cloud – where the storage infrastructure holds data in-memory.
As the data is being held in-memory, it cannot be located through a physical address to a location on a networked hard disk, but instead is tracked using hashes and replicated index tables across the cloud.
This lets you perform those classic graph queries that delve multiple levels deep, only they are sped up through the use of an memory-based storage infrastructure. It also means that offline graph analytics can be improved with parallel processing provided for by the distributed architecture.
A key difference for most graph databases is the query languages they utilise; each one independent to its own graph. Neo4j's language is called Cypher, which is a declarative, SQL-inspired language for describing patterns in graphs visually using an ASCII-art syntax.
OrientDB's query language commits to being SQL-like, while TitanDB uses the Gremlin query language, which is vying to become the standard. Similar to Cypher, it chains traversal operators together to form path-like expressions of how the query should be executed throughout the graph.
The real difference in Graph Engine is visible here, with its use of Language Integrated Knowledge Query (LIKQ). According to Microsoft this lets users express their query logic using lambda expressions. "It combines the capability of fast graph exploration and the flexibility of lambda expression: server-side computations can be expressed in lambda expressions, embedded in LIKQ, and executed on the server side during graph traversal," Microsoft said.
Translated: LIKQ arguments can be added to queries via Lambda, for analysis of the data within the graph and more complicated analytics.
Other differences? Graph Engine is Windows-only but will "soon" arrive on non-Windows platforms, according to this post on Hacker News.
Also, unlike the open-source crowd, there's no paid support, so use Graph Engine and you're on your own if things go wrong. Evidently, Microsoft want to encourage folk to play with Graph Engine, suggesting users ping them for "design consulting".
These are early days and "free" is the bait used in open source to encourage adoption. Payment comes later or in parallel for the "enterprise" edition.
If and when Graph Engine receives wide use, paid support will no doubt be forthcoming.
As the first processing-focused technology of its type, Microsoft will await uptake before commercialising Graph Engine as it has SQL Server or Azure SQL and with paid support. ®