Big Data is now TOO BIG - and we're drowning in toxic information

Just why are we hoarding every last binary bit?

Open ... and Shut Unless you have found a clever way of avoiding the internet completely, you no doubt have been warned that THERE IS A BIG DATA EXPLOSION! By many accounts, we are currently drowning in information - from log files to stock charts to customer profiles - and face a host of new products cropping up to help us manage the onslaught. Unfortunately, our fixation on hoarding and storing data may actually be making the problem worse, not better.

This is the message of Nassim Taleb's forthcoming book, Antifragile. Taleb made his name with the influential book The Black Swan, and his theory is bound to ruffle some feathers. As he explains in an excerpt from the book:

In business and economic decision-making, data causes severe side effects - data is now plentiful thanks to connectivity; and the share of spuriousness in the data increases as one gets more immersed into it. A not well-discussed property of data: it is toxic in large quantities - even in moderate quantities.

How can this be? We're told at every turn that more data equals better decisions. Yes, we need to parse all that binary to derive "actionable insights", which is the buzzphrase currently making the rounds of every Big Data startup's VC pitch deck. But once you do, ka-BOOM! Your business will immediately have superhero powers.

Except, of course, that it won't.

According to a Gartner survey [PDF], while the volume of corporate data is growing by upwards of 60 per cent each year, the vast majority of respondents (73 per cent) feel their competitors make better use of data than they do. And a mere 17 per cent reveal that they use more than 75 per cent of their data, which suggests most companies collect lots of data and have no clue what to do with them all.

But imagine what will happen when everyone uses data efficiently and to maximum potency: by definition, any competitive advantage will dissipate as all companies (and competitors) become Big Data maestros together. Of course, this will happen at different speeds for different companies, making the race to make sense of corporate data worthwhile.

But it still doesn't tackle Taleb's larger point: the more data we analyse, the more likely our insights from the data will be wrong. Quoting Taleb at length to ensure his point is not lost:

The more frequently you look at data, the more noise you are disproportionally likely to get (rather than the valuable part called the signal); hence the higher the noise to signal ratio. And there is a confusion, that is not psychological at all, but inherent in the data itself.

Say you look at information on a yearly basis, for stock prices or the fertilizer sales of your father-in-law’s factory, or inflation numbers in Vladivostock. Assume further that for what you are observing, at the yearly frequency the ratio of signal to noise is about one to one (say half noise, half signal) — it means that about half of changes are real improvements or degradations, the other half comes from randomness. This ratio is what you get from yearly observations.

But if you look at the very same data on a daily basis, the composition would change to 95 per cent noise, 5 per cent signal. And if you observe data on an hourly basis, as people immersed in the news and markets price variations do, the split becomes 99.5 per cent noise to .5 per cent signal. That is two hundred times more noise than signal — which is why anyone who listens to news (except when very, very significant events take place) is one step below sucker. ...

Now let’s add the psychological to this: we are not made to understand the point, so we overreact emotionally to noise. The best solution is to only look at very large changes in data or conditions, never small ones.

None of which is to suggest that there's no value in Big Data. One of the brand-name companies in Big Data, Cloudera, showcases a range of customer stories that describe ways real companies have derived real value from their data. (Disclosure: Cloudera's CEO is on the board of directors of my company, Nodeable.)

But let's not miss the trees for the forest. As Nick Carr writes, commenting on Taleb's findings: "Because we humans seem to be natural-born signal hunters, we're terrible at regulating our intake of information. We'll consume a ton of noise if we sense we may discover an added ounce of signal. So our instinct is at war with our capacity for making sense."

In other words, the problem isn't the data: it's our ability to know when we have enough data.

We're in the midst of a gold rush, when there's such a fever to collect data that we may be overextending ourselves. One former senior IT executive with one of Silicon Valley's largest web companies acknowledged that his company stores every log file - and does absolutely nothing with them. Never had, and likely never will. Some people suggest the answer is to start deleting this data to keep it manageable and to avoid security breaches. Maybe.

But perhaps a better solution would be to carefully consider which data are likely to be of use, and focus on these data. Yes, this runs the risk of overlooking data that could be useful but may not be immediately recognised as such. Splunk, after all, went public on the premise that log files from machine data are a gold mine for insight into one's business and IT operations, a gold mine that many had previously overlooked.

But we're not currently struggling to collect data. The industry's big need right now is to parse data, and a big part of that surely must be paring down the amount of data we collect in the first place. ®

Matt Asay is senior vice president of business development at Nodeable, offering systems management for managing and analysing cloud-based data. He was formerly SVP of biz dev at HTML5 start-up Strobe and chief operating officer of Ubuntu commercial operation Canonical. With more than a decade spent in open source, Asay served as Alfresco's general manager for the Americas and vice president of business development, and he helped put Novell on its open source track. Asay is an emeritus board member of the Open Source Initiative (OSI). His column, Open...and Shut, appears three times a week on The Register.

Similar topics

Other stories you might like

  • James Webb Space Telescope has arrived at its new home – an orbit almost a million miles from Earth

    Funnily enough, that's where we want to be right now, too

    The James Webb Space Telescope, the largest and most complex space observatory built by NASA, has reached its final destination: L2, the second Sun-Earth Lagrange point, an orbit located about a million miles away.

    Mission control sent instructions to fire the telescope's thrusters at 1400 EST (1900 UTC) on Monday. The small boost increased its speed by about 3.6 miles per hour to send it to L2, where it will orbit the Sun in line with Earth for the foreseeable future. It takes about 180 days to complete an L2 orbit, Amber Straughn, deputy project scientist for Webb Science Communications at NASA's Goddard Space Flight Center, said during a live briefing.

    "Webb, welcome home!" blurted NASA's Administrator Bill Nelson. "Congratulations to the team for all of their hard work ensuring Webb's safe arrival at L2 today. We're one step closer to uncovering the mysteries of the universe. And I can't wait to see Webb's first new views of the universe this summer."

    Continue reading
  • LG promises to make home appliance software upgradeable to take on new tasks

    Kids: empty the dishwasher! We can’t, Dad, it’s updating its OS to handle baked on grime from winter curries

    As the right to repair movement gathers pace, Korea’s LG has decided to make sure that its whitegoods can be upgraded.

    The company today announced a scheme called “Evolving Appliances For You.”

    The plan is sketchy: LG has outlined a scenario in which a customer who moves to a locale with climate markedly different to their previous home could use LG’s ThingQ app to upgrade their clothes dryer with new software that makes the appliance better suited to prevailing conditions and to the kind of fabrics you’d wear in a hotter or colder climes. The drier could also get new hardware to handle its new location. An image distributed by LG shows off the ability to change the tune a dryer plays after it finishes a load.

    Continue reading
  • IBM confirms new mainframe to arrive ‘late’ in first half of 2022

    Hybrid cloud is Big Blue's big bet, but big iron is predicted to bring a welcome revenue boost

    IBM has confirmed that a new model of its Z Series mainframes will arrive “late in the first half” of 2022 and emphasised the new device’s debut as a source of improved revenue for the company’s infrastructure business.

    CFO James Kavanaugh put the release on the roadmap during Big Blue’s Q4 2021 earnings call on Monday. The CFO suggested the new release will make a positive impact on IBM’s revenue, which came in at $16.7 billion for the quarter and $57.35bn for the year. The Q4 number was up 6.5 per cent year on year, the annual number was a $2.2bn jump.

    Kavanaugh mentioned the mainframe because revenue from the big iron was down four points in the quarter, a dip that Big Blue attributed to the fact that its last mainframe – the Z15 – emerged in 2019 and the sales cycle has naturally ebbed after eleven quarters of sales. But what a sales cycle it was: IBM says the Z15 has done better than its predecessor and seen shipments that can power more MIPS (Millions of Instructions Per Second) than in any previous program in the company’s history*.

    Continue reading

Biting the hand that feeds IT © 1998–2022