Real talk: You're gonna have to get real about real-time analytics if you wanna make IoT work

A gentle intro to design considerations for a large-scale internet-connected web of sensors


Backgrounder Many have started down the road of rolling out non-trivial Internet-of-Things platforms, and you may have, too, to some degree.

So, what happens next? Here are some things to think about – particularly in terms of handling all that data, noise and all, coming back live from your IoT network of sensors and gizmos. In short, as you'll see, you need some way to process and condense that real-time stuff into useful analytics, as well as storing that info.

Real-time analysis is necessary because, while you can compare performance of systems and processes over weeks and months and quarters, you probably need to know a few things immediately – such as fleet vehicles breaking down, factory lines stopping, parts running out, and so on.

Back in the day these were called alerts or heuristic-based alarms. Just like network-connected embedded electronics were rebranded to IoT, remote monitoring and analysis is now called real-time analytics. And with that in mind, let's take a closer a look at what is a fairly fiddly subject.

Size of the problem

Quantifying the state of the global IoT roll-out, Ovum earlier this year reckoned IoT projects that it classified as “small” – 500 or fewer devices or connections – account for half of European deployments. Two-thirds of enterprises have future plans for “bigger” projects, Ovum said.

Meanwhile, Vodafone in its 2018 IoT Barometer last year reckoned the number of large scale IoT projects with more than 50,000 connected devices had doubled – from a titchy base to a slightly less titchy base, from three to six per cent of those rolling out IoT.

So there’s room to grow.

IDC thinks that while the majority of IoT spending today is on hardware – modules and sensors – more than 55 per cent of spending will go on software and services by 2021.

Some of that budget will go towards security, compliance, and lifecycle management, but IDC's prognosticators expect applications and analytics to be in the lead.

IDC explained that's because IoT efforts most likely to be considered a "success" will be those that merge streaming analytics with machine-learning trained with data lakes, data marts and content stores, with performance boosted using dedicated accelerators or processors with built-in neural-network acceleration. The idea being that AI algorithms are used to automatically sort all the necessary info from the noise as it comes in.

And non-trivial, large-scale networks will need some level of machine intelligence to extract the needles from the haystacks.

Yes, it's probably going to involve some form of AI

Maria Muller, AI evangelist and intelligent industry lead at integrator Avanade, supported this view on the importance of live analytics. “No longer are analytics teams thinking about their daily, weekly, or quarterly reports. The demand for data, and understanding of it, needs to happen in real time,” she said.

“Analysts are managing much larger volumes of data than before IoT, which means they must also use tools designed for working with streaming and near-real time data.”

Well, how much data? In this Seagate paper, IDC reckons by 2025, more than a quarter of the 163 zettabytes (that’s billions of gigabytes) of data created will be real-time in nature, and IoT data will make up more than 95 per cent of this real-time information.

Tools and techniques

That demands new approaches to analytics and new tools and algorithms to undertake activities such as understanding customer behaviour, and delivering new or improving services. Traditional analytics systems built for batch or time-sharing operations are inadequate for up-to-the-second data. Real-time analytics demands new workflows and tooling.

To that end, DataStax solution architect Patrick Callaghan divides the flow of real-time analytics data into three stages: exhaust, real-time processing, and data for context. The exhaust is the collection point for the streams of data produced by IoT devices. “It’s hundreds, thousands or millions of devices sending information through some protocol, that end up on some message queue,” he said.

Instead of sending batches of exhaust data items in files, as non-real-time analytics systems might do, real-time analytics systems listen to IoT devices, consuming each data point as it emerges – perhaps just a handful of bytes in size. As streams flood in, the message queue can partition them into distinct groups and pass them on to the next part of the chain.

This doesn't have to be Kafkaesque... or does it?

There are many options for ingesting, queuing and storing this streaming, exhaust-type data. One that’s proving most popular is Apache’s Kafka, the scalable messaging system that began at LinkedIn in 2010.

Kafka lets you publish and subscribe to message streams, and it is distributed, meaning you can ingest and queue streams of records – each one a key-value pair – from devices at scale by distributing them across clusters of nodes. Apache also has Flume, another ingestion system for data streams often used for log files, which automatically pushes streamed data to a variety of destinations.

The major cloud providers provide streaming data queue options as part of their IoT and analytics services. For example, AWS has Kinesis, which replaces Kafka’s clustered nodes with scalable shards that you pay for as you need them. Microsoft’s Azure cloud has Event Hubs.

As the data passes through the streaming message queue, your system will need to buffer it before it’s processed by a streaming service that applies real-time transformation rules to the data.

In the cloud, Microsoft has Azure Stream Analytics for this, while, non-cloud, Apache Spark Streaming will process real-time data in tiny batches, called microbatches, using the Apache Spark Engine, which does in-memory computation and processing optimisation.

Next there's real-time processing, which is where the heavy-lifting is done to derive real-time insights from this streamed data. The Spark Engine can handle high-performance processing on microbatches. For better performance still, consider Apache Flink, which offers true, native continuous flow processing – no microbatches needed.

It may be that only a small window of bytes is needed per sensor – the latest five, 10 or 20 readings – before a decision can be made or alert sent out or reading updated, etc, so pick the best approach wisely for your design: microbatches or continuous flow.

Similar topics


Other stories you might like

  • Will this be one of the world's first RISC-V laptops?
    A sneak peek at a notebook that could be revealed this year

    Pic As Apple and Qualcomm push for more Arm adoption in the notebook space, we have come across a photo of what could become one of the world's first laptops to use the open-source RISC-V instruction set architecture.

    In an interview with The Register, Calista Redmond, CEO of RISC-V International, signaled we will see a RISC-V laptop revealed sometime this year as the ISA's governing body works to garner more financial and development support from large companies.

    It turns out Philipp Tomsich, chair of RISC-V International's software committee, dangled a photo of what could likely be the laptop in question earlier this month in front of RISC-V Week attendees in Paris.

    Continue reading
  • Did ID.me hoodwink Americans with IRS facial-recognition tech, senators ask
    Biz tells us: Won't someone please think of the ... fraud we've stopped

    Democrat senators want the FTC to investigate "evidence of deceptive statements" made by ID.me regarding the facial-recognition technology it controversially built for Uncle Sam.

    ID.me made headlines this year when the IRS said US taxpayers would have to enroll in the startup's facial-recognition system to access their tax records in the future. After a public backlash, the IRS reconsidered its plans, and said taxpayers could choose non-biometric methods to verify their identity with the agency online.

    Just before the IRS controversy, ID.me said it uses one-to-one face comparisons. "Our one-to-one face match is comparable to taking a selfie to unlock a smartphone. ID.me does not use one-to-many facial recognition, which is more complex and problematic. Further, privacy is core to our mission and we do not sell the personal information of our users," it said in January.

    Continue reading
  • Meet Wizard Spider, the multimillion-dollar gang behind Conti, Ryuk malware
    Russia-linked crime-as-a-service crew is rich, professional – and investing in R&D

    Analysis Wizard Spider, the Russia-linked crew behind high-profile malware Conti, Ryuk and Trickbot, has grown over the past five years into a multimillion-dollar organization that has built a corporate-like operating model, a year-long study has found.

    In a technical report this week, the folks at Prodaft, which has been tracking the cybercrime gang since 2021, outlined its own findings on Wizard Spider, supplemented by info that leaked about the Conti operation in February after the crooks publicly sided with Russia during the illegal invasion of Ukraine.

    What Prodaft found was a gang sitting on assets worth hundreds of millions of dollars funneled from multiple sophisticated malware variants. Wizard Spider, we're told, runs as a business with a complex network of subgroups and teams that target specific types of software, and has associations with other well-known miscreants, including those behind REvil and Qbot (also known as Qakbot or Pinkslipbot).

    Continue reading

Biting the hand that feeds IT © 1998–2022