File system trees are inefficient and slow when locating files in a filespace occupied by billions of files and folders. Storing the data as objects in a flat storage space is becoming a recommended alternative. But, as soon as you go for object storage to defeat this file system tree traverse problem, you face a fresh problem: how do you locate your objects?
Either you have a central object map or database or you have a distributed one. French startup Scality has gone for the distributed approach with its Ring technology.
The idea is to have virtual unlimited scalability, both of I/O and storage capacity, by using clustered commodity X86 servers organised as peer-to-peer nodes – conceptually occupying a ring – with front end Accessor software nodes receiving requests from users and applications on servers.
Scality CEO Jerome Lecat says an Accessor node can access any Ring node, note the "any", and find the right node storing a requested object in one network hop with 10 nodes, two hops with 100 nodes, and three hops with 1,000 nodes.
Holy Trinity in Scality's Ring technology: Accessors to the left, the Ring in the middle and secondary storage to the right.
A variety of Accessor node technologies are supported: native REST HTTP, NFS, BRS2 and Zimbra.
With each 10X increase in the Ring node count, the hop count goes up by one because of Scality's patented technology and its algorithm. We might call this a quite peculiar Ring cycle.
Lecat said: "There are really two 'tricks' here. [First] an algorithm delivering a maximum of Log(n) complexity – which basically gives one a 100-node network. Each node needs to know seven nodes, and a request may take seven hops. The minimum requirement from a mathematical standpoint is for each node to know a few other nodes. The number of nodes increases as Log(number of nodes), which means that when the number of nodes is x10, you need to add 1 to the number of nodes to be known, or number of hops.
"[Secondly] in practice, we allow nodes to know many more nodes, but this acts as a 'non authoritative cache', and it allows for a request to 'usually' converge in two hops, while keeping all the mathematical properties of the model (Log complexity, limited number of hops, good behaviour when a node is lost or added)."
Each node can handle 10 to 50TB of storage, with 1,000 nodes supporting up to 50PB of capacity, and accessing the right object in that 50PB with three hops on a gigabit LAN takes 20ms or less.
Distributed hash table
How does that work? Scality documentation says that the Ring nodes are organised into segments. Objects are stored with a Distributed Hash Table (DHT) algorithm, which produces a value for the object and its associated key. Key and value pairs are stored in the DHT and nodes retrieve the value associated with a particular key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes. Keys embed information about class of service, and each node is autonomous and responsible for consistency checking and rebuilding replicas automatically for its keys.
We can think in terms of Scality's Ring nodes crossing a key space. This is organised into a hierarchy such that a 10-node ring requires one node-to-node hop to find the target node, a 100-node ring needs two hops and a 1,000-node monster needs three hops.
Lecat says: "The key space is distributed among all the nodes. The key space is very large (20 bytes), and distributed nearly evenly, but never exactly evenly. The underlying algorithm is a distributed hash table. The 'segments' do not have a constant size (as everything has to be dynamic in the system to allow real elasticity).
"Two key properties of the key space are that keys have an order, and they are organised into a circle (which gives trigonometic properties)."
Let's take a 10-node Ring as an example. An Accessor sends in a object retrieval request to node 1, which doesn't have it. We're told the object can be retrieved with one hop, a jump from node 1 to the right node. Node 1 has enough information to send the request on to the right node, the one that holds the object, and so does every other node in the 10-node ring: that's how a distributed hash table works.
Scality doesn't say in detail how this works. I think it is a variation on this concept: each node has an ID and nodes are organised in a ring, a double-linked list, with each node having a reference to the previous node on the ring, its address, and the next ring node, and its address. Nodes going round the ring have successively greater node IDs until you return to the starting node.
Okay? Keep that in mind and let's move on to the request receiving node, which gets the key from the Accessor request and hashes it to generate a key of exactly the same number of bits as the node reference. The system uses this as a node ID and goes round the ring node by node, looking for a node ID that is the closest possible to the key hash while still being larger. That node should store the desired object.
A reversal of this is used to store incoming objects on the Ring and ensure they are locatable.
Lecat said: "If a node is lost, the ring rebalances itself without human intervention. [It's the] same if a ring node is added (human intervention needed to decide to add a node), the new node is automatically placed well in the key space, and rebalances only occur when necessary and automatically."
To understand any more than this requires a computer science skill set and access to the Scality Ring designers.