Real talk: You're gonna have to get real about real-time analytics if you wanna make IoT work
A gentle intro to design considerations for a large-scale internet-connected web of sensors
Backgrounder Many have started down the road of rolling out non-trivial Internet-of-Things platforms, and you may have, too, to some degree.
So, what happens next? Here are some things to think about – particularly in terms of handling all that data, noise and all, coming back live from your IoT network of sensors and gizmos. In short, as you'll see, you need some way to process and condense that real-time stuff into useful analytics, as well as storing that info.
Real-time analysis is necessary because, while you can compare performance of systems and processes over weeks and months and quarters, you probably need to know a few things immediately – such as fleet vehicles breaking down, factory lines stopping, parts running out, and so on.
Back in the day these were called alerts or heuristic-based alarms. Just like network-connected embedded electronics were rebranded to IoT, remote monitoring and analysis is now called real-time analytics. And with that in mind, let's take a closer a look at what is a fairly fiddly subject.
Size of the problem
Quantifying the state of the global IoT roll-out, Ovum earlier this year reckoned IoT projects that it classified as “small” – 500 or fewer devices or connections – account for half of European deployments. Two-thirds of enterprises have future plans for “bigger” projects, Ovum said.
Meanwhile, Vodafone in its 2018 IoT Barometer last year reckoned the number of large scale IoT projects with more than 50,000 connected devices had doubled – from a titchy base to a slightly less titchy base, from three to six per cent of those rolling out IoT.
So there’s room to grow.
IDC thinks that while the majority of IoT spending today is on hardware – modules and sensors – more than 55 per cent of spending will go on software and services by 2021.
Some of that budget will go towards security, compliance, and lifecycle management, but IDC's prognosticators expect applications and analytics to be in the lead.
IDC explained that's because IoT efforts most likely to be considered a "success" will be those that merge streaming analytics with machine-learning trained with data lakes, data marts and content stores, with performance boosted using dedicated accelerators or processors with built-in neural-network acceleration. The idea being that AI algorithms are used to automatically sort all the necessary info from the noise as it comes in.
And non-trivial, large-scale networks will need some level of machine intelligence to extract the needles from the haystacks.
Yes, it's probably going to involve some form of AI
Maria Muller, AI evangelist and intelligent industry lead at integrator Avanade, supported this view on the importance of live analytics. “No longer are analytics teams thinking about their daily, weekly, or quarterly reports. The demand for data, and understanding of it, needs to happen in real time,” she said.
“Analysts are managing much larger volumes of data than before IoT, which means they must also use tools designed for working with streaming and near-real time data.”
Well, how much data? In this Seagate paper, IDC reckons by 2025, more than a quarter of the 163 zettabytes (that’s billions of gigabytes) of data created will be real-time in nature, and IoT data will make up more than 95 per cent of this real-time information.
Tools and techniques
That demands new approaches to analytics and new tools and algorithms to undertake activities such as understanding customer behaviour, and delivering new or improving services. Traditional analytics systems built for batch or time-sharing operations are inadequate for up-to-the-second data. Real-time analytics demands new workflows and tooling.
To that end, DataStax solution architect Patrick Callaghan divides the flow of real-time analytics data into three stages: exhaust, real-time processing, and data for context. The exhaust is the collection point for the streams of data produced by IoT devices. “It’s hundreds, thousands or millions of devices sending information through some protocol, that end up on some message queue,” he said.
Instead of sending batches of exhaust data items in files, as non-real-time analytics systems might do, real-time analytics systems listen to IoT devices, consuming each data point as it emerges – perhaps just a handful of bytes in size. As streams flood in, the message queue can partition them into distinct groups and pass them on to the next part of the chain.
This doesn't have to be Kafkaesque... or does it?
There are many options for ingesting, queuing and storing this streaming, exhaust-type data. One that’s proving most popular is Apache’s Kafka, the scalable messaging system that began at LinkedIn in 2010.
Kafka lets you publish and subscribe to message streams, and it is distributed, meaning you can ingest and queue streams of records – each one a key-value pair – from devices at scale by distributing them across clusters of nodes. Apache also has Flume, another ingestion system for data streams often used for log files, which automatically pushes streamed data to a variety of destinations.
The major cloud providers provide streaming data queue options as part of their IoT and analytics services. For example, AWS has Kinesis, which replaces Kafka’s clustered nodes with scalable shards that you pay for as you need them. Microsoft’s Azure cloud has Event Hubs.
As the data passes through the streaming message queue, your system will need to buffer it before it’s processed by a streaming service that applies real-time transformation rules to the data.
In the cloud, Microsoft has Azure Stream Analytics for this, while, non-cloud, Apache Spark Streaming will process real-time data in tiny batches, called microbatches, using the Apache Spark Engine, which does in-memory computation and processing optimisation.
Next there's real-time processing, which is where the heavy-lifting is done to derive real-time insights from this streamed data. The Spark Engine can handle high-performance processing on microbatches. For better performance still, consider Apache Flink, which offers true, native continuous flow processing – no microbatches needed.
It may be that only a small window of bytes is needed per sensor – the latest five, 10 or 20 readings – before a decision can be made or alert sent out or reading updated, etc, so pick the best approach wisely for your design: microbatches or continuous flow.