You're doing Hadoop and Spark wrong and they will probably fail

Developers want the new shiny, users forget integration and then along come the vendors ...

Your attempt at putting Hadoop or Spark to work probably won't work, and you'll be partly to blame for thinking they are magic.

That's the gist of a talk delivered by Gartner research director Nick Heudecker at the firm's Sydney Data & Analytics Summit 2017.

Heudecker opened with the grim prediction that 70 per cent of Hadoop deployments made this year will fail to deliver either the expected cost savings or hoped-for new revenue.

The lack of trained and experienced people will be to blame for those misses and will also mean some face-palm moments once Hadoop is up and running: the analyst said the first question he often hears from new users is how they can actually get data into and out of their shiny new Hadoop cluster. He also felt the need to advise attendees to sort out their data quality and security plans before starting an implementation, as retro-fitting them is common and ill-advised.

According to Heudecker, organisations get into Hadoop and Spark with inflated expectations about what they can do. Neither tool, he said, is a replacement for databases or existing analytics tools.

“One client calls me every seven months and says they are replacing their data warehouse with Hadoop and say I hope they have their CV ready,” Heudecker half-joked.

To succeed with either tool, learn what they're good at and give them a new role your current analytics tools don't do well. But be stern with developers, too as they are “always chasing the new shiny thing” with little regard for wider concerns. The bottom line is you may not need either Hadoop or Spark.

Hadoop, for example, is very good at doing extract, transform and load (ETL)o perations at speed, but its SQL-handling features are less than stellar. It also chokes on machine learning and other advanced analytics tasks because it is storage-centric. That quality means it is expensive to implement on-premises, where you'll need to acquire memory, compute and storage together. In the cloud, by contrast, you can buy compute and storage separately and save some cash.

Heudecker therefore believes the cloud is the natural place to run Hadoop, adding that AWS is probably the world's largest user of the tool by revenue and scale.

The same goes for Spark, which is designed for in-memory processing and therefore makes for pricey hardware. But it's also excellent for machine learning, a workload other analytics tools just weren't designed to handle.

Another factoid to consider is that Spark evolves quickly, with point releases arriving in as little as five weeks. Adopting it can therefore also mean performing frequent upgrades in order to stay secure. Hold your ground and update on your schedule, not your vendor's, Heudecker advises.

One trap for young players that he identified is letting vendors sell you the complete Hadoop or Spark stacks, which comprise multiple packages, not all of which are necessary for basic operations. Paying for just the bits you need is so advisable that leading distributions of both tools now include pared-back bundles.

There's another risk there, he said, because Red Hat remains alone among pure-play open source businesses to crack the billion-dollar revenue mark. Volatility is therefore to be expected in the Hadoop and Spark caper.

But once you train your own people, find a worthy project, get on top of cloud vs. on-prem costs, master security and data quality, get your developers being sensible and work out a sensible relationship with a stable vendor, you have got a decent chance of succeeding.

Who likes those odds? Anyone? ®

Similar topics

Broader topics

Other stories you might like

  • Cerebras sets record for 'largest AI model' on a single chip
    Plus: Yandex releases 100-billion-parameter language model for free, and more

    In brief US hardware startup Cerebras claims to have trained the largest AI model on a single device powered by the world's largest Wafer Scale Engine 2 chip the size of a plate.

    "Using the Cerebras Software Platform (CSoft), our customers can easily train state-of-the-art GPT language models (such as GPT-3 and GPT-J) with up to 20 billion parameters on a single CS-2 system," the company claimed this week. "Running on a single CS-2, these models take minutes to set up and users can quickly move between models with just a few keystrokes."

    The CS-2 packs a whopping 850,000 cores, and has 40GB of on-chip memory capable of reaching 20 PB/sec memory bandwidth. The specs on other types of AI accelerators and GPUs pale in comparison, meaning machine learning engineers have to train huge AI models with billions of parameters across more servers.

    Continue reading
  • Amazon can't channel the dead, but its deepfake voices take a close second
    Megacorp shows Alexa speaking like kid's deceased grandma

    In the latest episode of Black Mirror, a vast megacorp sells AI software that learns to mimic the voice of a deceased woman whose husband sits weeping over a smart speaker, listening to her dulcet tones.

    Only joking – it's Amazon, and this is real life. The experimental feature of the company's virtual assistant, Alexa, was announced at an Amazon conference in Las Vegas on Wednesday.

    Rohit Prasad, head scientist for Alexa AI, described the tech as a means to build trust between human and machine, enabling Alexa to "make the memories last" when "so many of us have lost someone we love" during the pandemic.

    Continue reading
  • Microsoft promises to tighten access to AI it now deems too risky for some devs
    Deep-fake voices, face recognition, emotion, age and gender prediction ... A toolbox of theoretical tech tyranny

    Microsoft has pledged to clamp down on access to AI tools designed to predict emotions, gender, and age from images, and will restrict the usage of its facial recognition and generative audio models in Azure.

    The Windows giant made the promise on Tuesday while also sharing its so-called Responsible AI Standard, a document [PDF] in which the US corporation vowed to minimize any harm inflicted by its machine-learning software. This pledge included assurances that the biz will assess the impact of its technologies, document models' data and capabilities, and enforce stricter use guidelines.

    This is needed because – and let's just check the notes here – there are apparently not enough laws yet regulating machine-learning technology use. Thus, in the absence of this legislation, Microsoft will just have to force itself to do the right thing.

    Continue reading

Biting the hand that feeds IT © 1998–2022