BIG DATA wizards: LEARN from CERN, not the F500
Silos are for mudbloods
Big data has a problem: it is being abused. One of the biggest misconceptions is that big data is about archiving everything forever, buying the biggest, cheapest storage pool, and building a giant proverbial barn of hay in order to try to find needles.
Vendor marketing has abused this. Consider marketing that advises that you use that same NAS for Exchange and big data! Big data people have different storage management problems than the normal humble enterprise storage administrators.
Your typical storage administrator may struggle to deal with detailed compliance and disaster recovery issues. A big data admin will struggle to decide if check-pointing 4PB of in memory data is worth the hassle or if it will be cheaper to recreate the data.
Big data requires we rethink how we store data, where we store data, and what we store.
One of the best examples of the management of big data is the Large Hadron Collider (LHC) at CERN, generating 500EB per day from 150 million sensors.
Of the 600 million collisions per second the team monitors, it has around 100 collisions of interest per second that it wants to review. CERN has a few behaviours that set it apart from most shops in data management.
CERN filters the data as soon as possible. Filtering in the storage or archiving system is expensive. It requires not only significant storage capacity, but also network resources that become more expensive as they scale.
Essentially, 99.99 per cent of the sensor stream data produced is discarded. While the team may not "know what it's looking for" it knows what it has already seen.
As someone who's stared at one too many debug level syslogs, it's not about looking for the magical smoking gun, its about filtering out all of the noise. When looking for a needle in a haystack, its easier to just burn all the hay.
If you're talking about big data, in-memory processing of data is king. If data is processed on disk then this processing is occurring over high throughput, low latency connections. Dumber, faster, cheaper is the name of the game.
A 99.99 per cent available storage array that supports 5 protocols, and does all kinds of rich data services is the LAST place big data belongs. Data management, protection, and replication is either done inline, or will be done with the refined gold that is left over afterwards.
Traditional performance cheating tools such as large DRAM caches and array-based, auto-tiering are no match for steady state writes that never slow down. Traditional storage arrays are not the most cost effective location for these workloads.
Rather than trying to consolidate the data into large traditional arrays, armies of parallel processing nodes powered by technologies such as Hadoop divide and conquer the data. They can refresh nodes individually, rather than be bothered with massive monolithic storage array migrations.
Perhaps the biggest barrier to big data requires that teams cooperate to share and pool resources. A typical enterprise data centre will include dozens of individualised arrays in silos. In a modern enterprise it is not uncommon to see a pair of matching brand new arrays at 20 per cent capacity that are effectively "his and hers" as internal departments refuse to share.
In an enterprise, this sort of siloed wastage is often done to limit failure domains and enforce SLA's for latency. At CERN, large globally shared pools are the norm.
They are trading some of the "control" of failure domains and noisy neighbours that traditional enterprise storage environments operate under, for one big, fast, available pool.
Big data storage lends itself to storage zealots who are concerned about the efficiency's of the whole, rather than the "needs" of an individual application.
In the end, if you are looking to become a big data wizard, it may be time to throw out that F500 rule book on storage administration. ®