How Apache Spark lit up the tech world and outshone its big data brethren
El Reg queries author Matei Zaharia on a decade of the project
Interview Big data is no longer hailed as the "new oil." It has gone out of fashion, both in terms of hype and because its foundational technology – Apache Hadoop – was surpassed by cloud-based blob storage such as AWS S3. However, a sister project born in the big data era has become more influential in the modern world of LLMs and internet-scale data systems.
It's been roughly 10 years since Apache Spark 1.0 was released, and The Register caught up with the original author, Matei Zaharia. He wrote the code as part of his UC Berkeley PhD thesis before the project was donated to the Apache Foundation.
"I didn't imagine it would be this popular and widely used back then," he said. "The use of it is still increasing today, from everything we can see: Developers, downloads, and meetup groups and so on."
Spark started as an academic project in 2010 when Zaharia – a Romanian-Canadian national – saw the need to improve on a nascent technology called MapReduce, a Java-based programming model to manage big data across clusters on the Hadoop Distributed File System.
"There were all these data-intensive computing things happening in the web companies at the time – basically Google, Yahoo, and Microsoft," he said. "There were also these distributed computing frameworks like MapReduce, and I was really interested in learning more about those and bringing that kind of parallel computing to a lot more users."
Although MapReduce was a way of cracking big data problems on commodity hardware, it was really targeted at software engineers, Zaharia explained.
"Their approach is very different from someone doing just interactive analysis, for example. Imagine if you want to use something like Microsoft Excel, you can very quickly get some results for what you want. You're not building something that's gonna run your business, but these early frameworks started for the engineers and they didn't do the distribution, fault recovery, and scheduling for you. You had to write it all by hand."
Zaharia took inspiration from the researchers using big data for machine learning or discovering new viruses, for example. "These are really interesting use cases where they won't sit down and learn Java and spend many weeks building an application. We wanted to make it as easy as possible for them to do their stuff," he said.
Part of the plan to broaden the appeal was to introduce new programming languages. As well as Java, users can work in Scala, statistical language R, C#, and Python, a high-level general-purpose language that has achieved widespread popularity in machine learning. The de facto database language standard SQL was added in 2014.
Not only did Zaharia's plan work, Spark became open source in 2010 under a BSD license and was later donated to the Apache Software Foundation, becoming a top-level project in 2013.
In the meantime, Zaharia went on to co-found Databricks, which was initially based around providing Apache Spark as a service in the cloud. In the ten years since it was founded, the company has spread out to provide a so-called data lakehouse platform, combining both exploratory data lakes with more ordered data warehouse system queries with SQL. Earlier this year, it launched its own large language model, DBRX, an open source data catalog (Unity), and tools to build, deploy, and monitor AI and ML solutions in Mosaic AI.
While Spark remains part of Databricks' core offering, its strength lies in its independence from the vendor. All the main cloud vendors – AWS, Google, Azure, Oracle, and IBM – offer Spark as a service, while independent vendors such as Qubole, which provides an open source data lake, also offer it.
- Lakehouse dam breaks after departure of long-time Teradata CTO
- Databricks claims its open source foundational LLM outsmarts GPT-3.5
- Postgres pioneer Michael Stonebraker promises to upend the database once more
- Microsoft, Databricks double act tries to sew up the data platform market
Prasad Pore, Gartner senior director and analyst for data and analytics, said Spark was widely used for data processing and preparation, as well as analytics. "When it comes to processing large amounts of data, Apache Spark is a very proven, robust and fault-tolerant technology. Because of that, Spark has pretty good adoption in the market, either via vendors, as a managed offering, or via an open source implementation."
The secret to its success lies in its ability to process data in-memory – where MapReduce had written to disk – and ensure the distributed processing does not fall over.
"In-memory processing provided a tremendous performance improvement over batch jobs," Pore told The Register. "That was the main value proposition. Fault tolerance is also a very critical element when it comes to large amounts of data processing. Imagine if you are processing 10 TB of data and somehow the batch fails. You need to know that there is a fault-tolerant mechanism. If a node fails, it should be able to recover automatically. That is the robust architecture Spark has."
While Zaharia no longer contributes code directly to the project, he helps manage and advise the Databricks team that works on Spark.
He said that by making the project open source in its early years, it has encouraged "a big ecosystem of libraries" that has helped the platform "get better" for everyone. "Making it easy to extend libraries was good," he said.
But one thing he wished he'd introduced from the start is a sort of backward compatibility between applications and Spark. He said the community was working on Spark Connect, which would let the client applications written against Spark become independent of the version server and cluster behind it.
"We're working on that now. It's kind of cool, but I wish we'd done it at the beginning if we had thought about it," he said.
He promised that after Spark 4.0, expected to be released in June, no one would have to update their apps to a new version of Spark. "Of course, they can write new apps to take advantage of the new features," he said.
While Spark may have been born in the era of big data and Hadoop, it is vital to the latest trend in computing. Zaharia said the largest LLMs in the world all prepared their data using Spark. "It's one of the use cases we care about. It's been really interesting to see new things people are doing with it." ®