Microsoft arrived on the graph-database scene last month. Three years in the making, Microsoft released Trinity under a typical-by-now-of-Microsoft-boring-trade name of Graph Engine.
Already on that scene are Neo4J, MarkLogic, Oracle, SAP and Teradata - among others.
Driving Microsoft, like those before, is the desire to connect - to establish connections between things and derive some kind of gain.
Those “things” could be people, “likes”, online sales – tech firms are almost literally trying connecting the dots or as they like them to be called “nodes.”
But that’s so last year. The new thing is Artificial Intelligence and the Machine Learning that gets us there. And, slowly, tech firms are pushing graph engines as an enabler of ML.
Neo4J, for example, offers Graphify an unmanaged extension for language text classification. Microsoft’s Concept Graph, here, is build on Probase that uses Trinity for natural language and machine-language understanding.
All that ML and Artificial Intelligence (AI) jazz is directly founded on mathematics and while you’ll no doubt have heard much hype about ML and AI and talk of tools and frameworks, you are missing the point – or, rather, getting a little ahead of the game.
In fact, nothing demonstrates the need to familiarise your self with mathematics better than graph databases and nothing demonstrates that better than the City of Kaliningrad on the Baltic in Russia.
But before we get there, the first thing you have to realise is that “maths” is not “arithmetic” as many might believe. I was never taught this at school (or I was away that day) so I confused the two for years. Now I know that arithmetic is doing sums and maths is about understanding and using the patterns that occur naturally in numbers. It can and often does involve equations but they are simply a shorthand way of expressing ideas. So, maths is about using the patterns and maths research is about finding those patterns in the first place. And some of that research has revolutionised our lives.
Some maths research starts with questions that appear to have no connection to numbers at all. For example, consider the city of Kaliningrad. A river runs through it and forms two islands and the city is built both on the banks of the river and on the two islands.
Now, back to Kaliningrad, or rather, Königsberg as it was named in the 18th century, which is where we must travel. The city is cut in half by a river, the Pregoyla, and in the Pregoyla are two islands. Some in the city had become fascinated by what would later became maths-based question: was it possible to cross each of the seven bridges in the city just once in one journey? Were talking four land masses with four bridges connecting one island to the mainland - two North and South), two bridges connecting the second island (one North and one South), and one bridge connecting the two islands – East and West.
You can of course try to draw a path and fail.
But failing doesn’t prove that it can’t be done. Succeeding proves that it can be done but no one could succeed. Swiss mathematician, physicist, astronomer, logician and engineer Leonhard Euler became interested in the problem and he did what mathematicians do: he reduced an apparently complex problem to a set of basic components and proceeded to derive the rules by which they operated.
This approach meant that he (and now we) could not only solve the Königsberg problem but solve it for any number of bridges in any city. Better still this led to the development of what we now call graph theory and graph databases - which are so generalised that they can be used for space research.
Euler realised that the shape of the land was immaterial to the problem, so he reduced the land masses to a set of points or what we would now call (in a graph database) nodes. The bridges are simply links between the nodes and are now called edges.
Once simplified, Euler reasoned as follows. If one node has two edges and you start your journey on the node then you have to cross one edge to get off it (1).
You can then wander around as much as you like, visiting other nodes and crossing other edges but ultimately you have to cross the other edge (2) and you are now stuck on the node with no way to get off.
To put that more simply, if you start on a node with two edges you have to finish on it. (To translate back to the Königsberg bridge problem, if you start on one land mass with two bridges, you have to finish on that land mass.) This turns out to be true for all nodes with an even number of edges. The sequence: off, on, off, on, off, on always has to end with on.
And by the same token, if you don’t start on a node with two edges you can cross onto it but you have to cross the other edge on your way off, so if you start off on a node that has two edges, you have to also finish off the node. (In Königsberg speak, if you start off on a land mass with only two bridges, you have to finish off that land mass.) And this is also true for all nodes with an even number of edges.
And, as you have already guessed, the rules are reversed if a node has an odd number of edges. If you start on a node with an odd number of edges you have to finish off it; if you start off it you have to finish on it.
So we have a set of rules that logic (or Euler) tells us is irrefutable. The table shows us where you must finish for a given set of starting conditions:
|Even number of edges||Odd number of edges|
|Start on node||On||Off|
|Start off node||Off||On|
Now, if you have two nodes connected by two edges, we can complete the task.
We can start on and finish on node A, which agrees with the rules. We can start off and finish off node B, which also agrees with the rules.
So, in order to solve the Königsberg problem, a really important general question to ask is: “How many nodes have an even number of edges and how many have an odd number?” The answer is that there are four nodes and they all have an odd number of edges.
We know that for each of these four nodes: if you start on the node you must finish off the node and if you start off the node, you must finish on the node.
We have to choose one node on which to start, so there are three nodes from which we don’t start. If we don’t start on a given node we have to finish on that node. That means, to solve the problem, we have to finish on three different nodes. That is impossible, so the Königsberg bridge problem is unsolvable.
Euler solved the problem and it is important to note that there is not an equation or even a sum in sight. Mathematics is about logic and the equations that we often associate with it are merely formal ways of expressing that logic. So I am certainly not saying that you cannot use formulae to represent the Königsberg bridge problem but I am saying the underlying logic is more important.
In 1735 Euler presented his work to the St. Petersburg Academy of Science, crested by western-reforming Russian Czar Alexander the Great that is today the Russian Academy of Sciences.
The fact that Euler formally presented it to such an august body tells you that, of course, he recognised that this was no longer about bridges but was an entirely new branch of mathematics. Subsequently other mathematicians became interested in the field and so the bridge problem gave birth to the entire field which is now called graph theory.
So how do we get from graph theory to graph databases?
Underpinning every type of database engine lies a data model; the one underpinning relational database engines is based on (to no one’s great surprise) relations. (A relation is approximately maths-speak for a table).
Now the underlying data structure is what ultimately defines the strengths and weaknesses of the database engine. Relational database engines are very good at joining tables together and then sub-setting the result by column and row. They are very bad at solving problems like the travelling salesman problem which is a classically hard computational problem to solve. In short, suppose we have 20 towns and we know the distances between every pair. A salesman has to visit each town and can do so in any order. What is the shortest route he can take?)
Graph databases are underpinned by a data model that consists of edges and nodes. You are, no doubt, well ahead of me already. In order to solve the travelling salesman problem, you can use a graph database and store the towns as nodes and the distances as edges. Graph databases are much better at solving this class of question. And they can do so much more: think about Linkedin or Facebook where people are the nodes and the “likes” are the edges. Then there’s Amazon: you are a “customer” node and products are “item” nodes. When you buy an item, a “bought” edge is created between your node and the product node.
Now the graph database engine can track out from the product node, find all the other people who bought it, see what they also bought and lo, the recommendation engine is born.
Mathematics isn’t about equations that few people can understand, it is about logic and finding patterns. It underpins all the work we do in analytics. And it’s fundamental to graph engines and graph databases. Building these systems is not merely about arithmetic: it’s about thinking conceptually and reasoning things out – often before the numbers even show up. ®