For all of its advances, the IT sector’s first five decades could be characterised as the electronic storing of systems of record.
The move to the electronic era saw paper ledgers, tax returns, bank statements stored and archived safely and legally on machines instead of paper files.
Rows of data were worked on in spreadsheets and stored in SQL relational databases. Since then data has been everywhere. It has been in data warehouses, it has been in data lakes, up mountains where it has been mined and in pools. It is now so voluminous that it can even be measured in something called a Brontobyte. (Though it is a generally accepted we are in the Zettabyte era.)
And more recently this digital era of data and big data saw enterprises embark on the quest to extract value from the information in order to make accurate forecasts.
The Holy Grail of data value began within a subset of mathematics with a discipline called probability and statistical analysis.
Specialists interrogated the data for patterns in order to conduct fraud detection, measure marketing campaign effectiveness or grade insurance claim assessments.
Probability and statistical analysis (a tough and not a very popular career choice) became Business Intelligence that in turn evolved into Data Science (highly sought after and well paid).
What a data scientist is and does has been described in many ways. It requires deep understanding of probability and statistics, a domain expertise such as finance or health and a high level of expertise in machine learning and the workings of big data frameworks such whether commercial such as SAP Hana or open source like Hadoop and their associated platforms, languages and methods.
Analytics now spans five categories: descriptive, diagnostic, predictive, prescriptive, and cognitive, with each building on the last.
The current effort of gaining value from business data is called advanced analytics.
Advanced analytics means using methods to pull meaning from data that will allow for accurate forecasts and predictions. It can also mean managing the added complexity of analysing combinations of structured and unstructured data and doing so in real time.
So, if a company is collecting a petabyte of data each day, say for example a telco or mobile network operator, then – it is argued - the use of advanced analytics will help that company better serve its customers by knowing that they will want or will do next. This could help reduce churn and allow companies to up- and cross-sell features and services.
The big data storage challenges
The implications of advanced analytics for the IT professional with responsibility for the storage, security and accessibility of this vast data pool are huge. Simply managing the volumes of data pouring into the organisation is proving to be a challenge. For example, even powering and cooling enough HDD RAID arrays to store an Exabyte of raw data would break the budget of most companies.
Increasingly it will be software-defined storage and flash that will be deployed for big data as advanced analytics promises more insight for direct business benefit. This will be thanks to these media's improved speed, density, performance and reliability relative to disk and it could fundamentally change the storage infrastructure strategies of enterprises and organisations.
And, also increasingly, it’s Apache Spark Hadoop or Spark on top of Hadoop that are serving the software side of big-data analytics. Whether your big-data cluster is built exclusively on these open-source architectures on some other commercial big-data framework will impact your storage decision.
Hadoop is the open-source framework for processing big data and led to the emergence of server clusters specifically for the storage and analysis of large data sets.
It has been around since 2005 and Facebook is considered to have the largest Hadoop cluster on the planet consisting of millions of nodes.
But for those of us mere mortals starting out exploring advanced analytics of big data sets, the number of nodes initially used will be numbered in the hundreds or even low tens.
Where and how you proceed once you’ve bought into Hadoop is less clear. There exist differing opinions on the existence standard infrastructure approaches for Hadoop clusters. Do Hadoop servers and storage need to be isolated from general-purpose rack based servers and storage – and if so why?
With the large data volumes at play, it is commonly advised that dedicated processing, storage and networking equipment be deployed in separate racks to avoid potential performance and latency issues. Running Hadoop in a virtual environment is said to be best avoided for the same reasons.
As a distributed file system, the cost implications for running Hadoop over a SAN are also prohibitive. Storage systems for Hadoop are almost always local disks or attached RAID, and SAN-attached Hadoop clusters will often run into performance issues.
To help find a way through these issues the Apache Software Foundation, the overseer of Hadoop and Spark, has come up with the following advice:
- Hadoop, including HDFS (Hadoop Distributed File System), is suited for distributed storage and distributed processing using commodity hardware. It is fault tolerant, scalable, and extremely simple to expand.
- HDFS, used to provide large-scale clustering of Hadoop systems and that sees files stored in blocks on datanodes, can be widely configured. However, configurations need only be tuned for very large clusters. An HDFS cluster can recognize the topology of racks where each node is placed. It is important, therefore, to configure this topology to optimize the data capacity and usage.
Also worth nothing is the fact that raw storage capacity for Hadoop can be directly attached.
That’s the architectural thought, then, but no matter what your chosen approach to applying advanced analytics to big-data sets it cannot live in isolation from investing in suitable and robust storage at the right scale. Think, for example, systems capable of scaling to 60PB of data under a single file system and that enable processing, storing and analysis in one system.
Another key, and potentially hidden, aspect of this is that the data storage costs must be kept low and data compression features are likely to be used.
Times are a-changing and that means analytics are changing, too. Big-data analytics for many applications are now likely to be real time and not the batch processes that totally dominated traditional analytics. This has obvious implications for the type of storage infrastructure and the use of flash for performance.
What should that infrastructure look like? Your goal should be to scale up to real-time analytics or scale out to include the big data sets in your analytics environment, or – depending on the workload in question – both.
Underpinning this is a need for a high-performance, scalable infrastructure that’s able to acquire, store, and protect data. It should also be capable of running both commercial and open-source analytics applications and of drawing on data from a range of structured and unstructured repositories.
This, of course, has implications for your data center: can it sustain and deliver all of this? To that end, you’ll need to assess the condition of your data centre and plan possible upgrades to computing, storage, and networking.
Big data means business analytics has gone beyond being a purely desktop consideration - it’s become part of the storage infrastructure, and that’s changing the storage technologies, the architectures and the media you’ll need.