Spark man Zaharia on 2.0 and why it's 'not that important' to upstage MapReduce

Matei tells us about his brainchild

Spark version 2.0

“The first major thing you'll see is Spark releasing v2.0,” Zaharia told The Register, with a release most likely due in April. Spark 2.0 won't be "a complete change" and will “mostly be backwards compatible” the inventor added.

“There are a few very small legacy changes,” we were told, as Spark is “dropping support for really old versions of Hadoop, but we're bringing in a new set of APIs which are very efficient.”

Spark's original DataFrame API was introduced in v1.3, while the Dataset API was released as a preview in v1.6 – the most recent release, which arrived in January of this year.

Developers should be comforted to hear that Spark v2.0 will be producing tidy versions of these previously experimental features: “The DataFrame and Dataset APIs will be way more efficient,” Zaharia told us, explaning that “they'll tell the engine more about the structure of the data, and store and parse, in a more efficient way.”

The APIs are intended to be easier to use than they have been before and will be based on Python and R single-machine data science tools. “A lot of Spark libraries can now take data that comes in as data frames,” said Zaharia.

Streaming data has also been a big focus for the team at Apache. Spark's current streaming module, Spark Streaming, was another “big focus” according to Zaharia.

“The thing that's most unique about Spark and data streaming is that the same machine can do batch computation. Other projects are streamingly only, or either/or, but not both. You can be receiving a stream of data and at the same time cross-compare that stream with stored static data. Or as you're receiving it, you have run new queries about it, like 'What's the video doing in this region of the world?', and with Spark Streaming you bring in this query.”

The Apache team is trying to expand these abilities, according to Zaharia: “The higher-level application interface that's under development for streaming is currently called Streaming DataFrames. The core focus though is on combining streaming with other types of computation, for building complete applications.”

“This is a thing that we're working on pretty heavily. We're investing in streaming, not just in the engine, but on making all the other libraries in spark work in a streaming fashion – as not all of them do so at the moment. We have a whole bunch of machine learning algorithms which you can just call, 20-30 of them, and we want to make it possible to call them on a stream of data as well as on static data,” said Zaharia. “The updates to the streaming data will make that easier to happen.”

Project Tungsten

Project Tungsten is an attempt to “make Spark use modern hardware more efficiently,” said Zaharia. “When you write computation in Spark, especially in high-level APIs like SQL, we will convert it to low-level machine code, this is more efficient than running Python out of the box, and this in Spark 2.0 this is likely to become quite a bit more efficient. According to our early benchmarks it can easily become five or 10 times faster, especially with memory data.”

The second aspect of Project Tungsten's is “where you store the data,” added Zaharia. “It will be possible with our chanes to use solid state disks and other non-volatile memory. That's not yet started, though the other stuff is being actively worked on. We're currently designing the changes in Spark to make use of that hardware.”

Zaharia acknowledged that “something that isn't part of the design of Spark” at the moment was “transaction processing,” adding that users were confused about thtis “becayse they see 'real-time' or 'big data' as terms. What I mean by transaction processing is remembering small changes to a data centre over time. Something like that will be out of scope for a while.”

“Something else is adding a storage layer to Spark,” he told The Register. At the moment Spark uses existing storage layers, and is able to cache data outside of it – but there is no time-frame for providing Spark with its own storage layer in 2016.

Zaharia added “We want to expand support for the R programming language and to some extent SQL. We still need to provide a lot of the libraries for R,” said Zaharia.

Spark's Big Blue moment

In June of last year, IBM announced it was putting 3,500 of its researchers and developers behind Spark – which Zaharia described as “a really great moment,” before adding: “It's created a lot more interest in the project, and it shows users and enterprises that Spark is going to have backing for a long time.”

“There were quite a few large companies already using Spark,” said Zaharia, “including IBM itself, Intel, Yahoo!, companies like that, but on the commercial side of 'How can I get a company to help with it?' it was mostly young companies and start-ups.”

”I met with people from IBM, and they're doing a variety of things. They're changing a lot of IBM's projects to use Spark as their scale-out layer, which is great for our project as they'll use Spark with their own projects in the future, they're really trying to use it in various places. Microsoft, and AWS, and Oracle, all have products based on Spark and have said it's an integral part of their strategy.”

While Zaharia doesn't expect there will be a lot of impact on the day-to-day development on Spark (“That's done by a unch of other companies and individual contributors, and it has its own kind of procedures based on a 'let's have a fast and stable release cycle' thing.”) IBM's involvement “will make it much easier for companies to adopt Spark, seeing a wide variety of support around it.”

The Spark Summit opens tomorrow in New York City, and runs until Thursday. ®

Similar topics

TIP US OFF

Send us news


Other stories you might like