Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Matei tells us about his brainchild

Spark version 2.0

“The first major thing you'll see is Spark releasing v2.0,” Zaharia told The Register, with a release most likely due in April. Spark 2.0 won't be "a complete change" and will “mostly be backwards compatible” the inventor added.

“There are a few very small legacy changes,” we were told, as Spark is “dropping support for really old versions of Hadoop, but we're bringing in a new set of APIs which are very efficient.”

Spark's original DataFrame API was introduced in v1.3, while the Dataset API was released as a preview in v1.6 – the most recent release, which arrived in January of this year.

Developers should be comforted to hear that Spark v2.0 will be producing tidy versions of these previously experimental features: “The DataFrame and Dataset APIs will be way more efficient,” Zaharia told us, explaning that “they'll tell the engine more about the structure of the data, and store and parse, in a more efficient way.”

The APIs are intended to be easier to use than they have been before and will be based on Python and R single-machine data science tools. “A lot of Spark libraries can now take data that comes in as data frames,” said Zaharia.

Streaming data has also been a big focus for the team at Apache. Spark's current streaming module, Spark Streaming, was another “big focus” according to Zaharia.

“The thing that's most unique about Spark and data streaming is that the same machine can do batch computation. Other projects are streamingly only, or either/or, but not both. You can be receiving a stream of data and at the same time cross-compare that stream with stored static data. Or as you're receiving it, you have run new queries about it, like 'What's the video doing in this region of the world?', and with Spark Streaming you bring in this query.”

The Apache team is trying to expand these abilities, according to Zaharia: “The higher-level application interface that's under development for streaming is currently called Streaming DataFrames. The core focus though is on combining streaming with other types of computation, for building complete applications.”

“This is a thing that we're working on pretty heavily. We're investing in streaming, not just in the engine, but on making all the other libraries in spark work in a streaming fashion – as not all of them do so at the moment. We have a whole bunch of machine learning algorithms which you can just call, 20-30 of them, and we want to make it possible to call them on a stream of data as well as on static data,” said Zaharia. “The updates to the streaming data will make that easier to happen.”

Project Tungsten

Project Tungsten is an attempt to “make Spark use modern hardware more efficiently,” said Zaharia. “When you write computation in Spark, especially in high-level APIs like SQL, we will convert it to low-level machine code, this is more efficient than running Python out of the box, and this in Spark 2.0 this is likely to become quite a bit more efficient. According to our early benchmarks it can easily become five or 10 times faster, especially with memory data.”

The second aspect of Project Tungsten's is “where you store the data,” added Zaharia. “It will be possible with our chanes to use solid state disks and other non-volatile memory. That's not yet started, though the other stuff is being actively worked on. We're currently designing the changes in Spark to make use of that hardware.”

Zaharia acknowledged that “something that isn't part of the design of Spark” at the moment was “transaction processing,” adding that users were confused about thtis “becayse they see 'real-time' or 'big data' as terms. What I mean by transaction processing is remembering small changes to a data centre over time. Something like that will be out of scope for a while.”

“Something else is adding a storage layer to Spark,” he told The Register. At the moment Spark uses existing storage layers, and is able to cache data outside of it – but there is no time-frame for providing Spark with its own storage layer in 2016.

Zaharia added “We want to expand support for the R programming language and to some extent SQL. We still need to provide a lot of the libraries for R,” said Zaharia.

Spark's Big Blue moment

In June of last year, IBM announced it was putting 3,500 of its researchers and developers behind Spark – which Zaharia described as “a really great moment,” before adding: “It's created a lot more interest in the project, and it shows users and enterprises that Spark is going to have backing for a long time.”

“There were quite a few large companies already using Spark,” said Zaharia, “including IBM itself, Intel, Yahoo!, companies like that, but on the commercial side of 'How can I get a company to help with it?' it was mostly young companies and start-ups.”

”I met with people from IBM, and they're doing a variety of things. They're changing a lot of IBM's projects to use Spark as their scale-out layer, which is great for our project as they'll use Spark with their own projects in the future, they're really trying to use it in various places. Microsoft, and AWS, and Oracle, all have products based on Spark and have said it's an integral part of their strategy.”

While Zaharia doesn't expect there will be a lot of impact on the day-to-day development on Spark (“That's done by a unch of other companies and individual contributors, and it has its own kind of procedures based on a 'let's have a fast and stable release cycle' thing.”) IBM's involvement “will make it much easier for companies to adopt Spark, seeing a wide variety of support around it.”

The Spark Summit opens tomorrow in New York City, and runs until Thursday. ®

Other stories you might like

  • James Webb Space Telescope has arrived at its new home – an orbit almost a million miles from Earth

    Funnily enough, that's where we want to be right now, too

    The James Webb Space Telescope, the largest and most complex space observatory built by NASA, has reached its final destination: L2, the second Sun-Earth Lagrange point, an orbit located about a million miles away.

    Mission control sent instructions to fire the telescope's thrusters at 1400 EST (1900 UTC) on Monday. The small boost increased its speed by about 3.6 miles per hour to send it to L2, where it will orbit the Sun in line with Earth for the foreseeable future. It takes about 180 days to complete an L2 orbit, Amber Straughn, deputy project scientist for Webb Science Communications at NASA's Goddard Space Flight Center, said during a live briefing.

    "Webb, welcome home!" blurted NASA's Administrator Bill Nelson. "Congratulations to the team for all of their hard work ensuring Webb's safe arrival at L2 today. We're one step closer to uncovering the mysteries of the universe. And I can't wait to see Webb's first new views of the universe this summer."

    Continue reading
  • LG promises to make home appliance software upgradeable to take on new tasks

    Kids: empty the dishwasher! We can’t, Dad, it’s updating its OS to handle baked on grime from winter curries

    As the right to repair movement gathers pace, Korea’s LG has decided to make sure that its whitegoods can be upgraded.

    The company today announced a scheme called “Evolving Appliances For You.”

    The plan is sketchy: LG has outlined a scenario in which a customer who moves to a locale with climate markedly different to their previous home could use LG’s ThingQ app to upgrade their clothes dryer with new software that makes the appliance better suited to prevailing conditions and to the kind of fabrics you’d wear in a hotter or colder climes. The drier could also get new hardware to handle its new location. An image distributed by LG shows off the ability to change the tune a dryer plays after it finishes a load.

    Continue reading
  • IBM confirms new mainframe to arrive ‘late’ in first half of 2022

    Hybrid cloud is Big Blue's big bet, but big iron is predicted to bring a welcome revenue boost

    IBM has confirmed that a new model of its Z Series mainframes will arrive “late in the first half” of 2022 and emphasised the new device’s debut as a source of improved revenue for the company’s infrastructure business.

    CFO James Kavanaugh put the release on the roadmap during Big Blue’s Q4 2021 earnings call on Monday. The CFO suggested the new release will make a positive impact on IBM’s revenue, which came in at $16.7 billion for the quarter and $57.35bn for the year. The Q4 number was up 6.5 per cent year on year, the annual number was a $2.2bn jump.

    Kavanaugh mentioned the mainframe because revenue from the big iron was down four points in the quarter, a dip that Big Blue attributed to the fact that its last mainframe – the Z15 – emerged in 2019 and the sales cycle has naturally ebbed after eleven quarters of sales. But what a sales cycle it was: IBM says the Z15 has done better than its predecessor and seen shipments that can power more MIPS (Millions of Instructions Per Second) than in any previous program in the company’s history*.

    Continue reading

Biting the hand that feeds IT © 1998–2022