Google must be Beaming as Apache announces its new top-level projects
eBay's Eagle monitoring software also soaring with open-source foundation
The Apache Software Foundation has today announced two new top-level projects, Apache Beam and Apache Eagle.
Apache Beam is yet another technology birthed by Google's work on data processing, and its roots can be traced back to Google's initial MapReduce system which revolutionised the science of distributed data processing when it was published in 2004.
In 2015, Google published the "must-read" Dataflow paper by Tyler Akidau and colleagues [PDF]. Although Dataflow shared a name with the Hortonworks product, the paper described technology which in fact mapped better to Apache Spark, we were informed by Cloudera's director of data science, Sean Owen.
Unlike Spark, Dataflow is "more like a streaming-first framework that can also do batch, with special emphasis on handling problems of streaming semantics like out-of-order events and complex windowing," Owen told The Register, describing Apache Beam as like a Dataflow SDK, or software development kit.
Beam is "the developer API for programs that run on Dataflow. It is well thought-out, created by a strong team, created especially for streaming – and I understand it's used within Google, which says a lot about its effectiveness," Owen continued.
"The disappointment is that, of course, Dataflow is a proprietary Google service," said Owen, who noted that this made its availability "a great move for Google's cloud offering".
As of Beam's latest release,
0.4.0, it does support runners for open-source Flink, Spark and Apex, but Owen stated that he would remain concerned about stacking these complicated distributed tools as a user.
"The complexity of stacking Beam on top of Spark or Flink" was the issue. "Both are complex in their own right. There is a bit of a lowest-common-denominator problem here, in that some things in Beam don't totally match Spark and vice versa. Still, it's an option."
I'm not surprised that there are more and more Spark-like projects emerging. If a problem is worth solving, then you can be sure it will be solved ten times over in the open-source world. After all we still see new databases and SQL engines every year.
Eventually, one will supersede Spark, but I don't think we're there yet. Flink and Beam's selling points are largely around streaming, though Spark 2's structured streaming adds some of this same functionality to Spark, and that's already generally available.
From Cloudera's perspective, "Spark still has orders of magnitude more usage, so will be the distributed computation framework in CDH for the foreseeable future," said Owen. "Although there's no formal support for anything else, we dabble in other new projects promiscuously to keep tabs on what's emerging. For example, Cloudera's Marton Balassi is a Flink committer, and Cloudera contributed the initial Spark-Beam integration."
Davor Bonaci, a software engineer at Google and the project management committee chair for Apache Beam, said: "Going forward, we will continue to extend the core abstractions to distill additional complex data processing patterns into intuitive APIs, and, at the same time, enhance the ability to interconnect additional storage/messaging systems and execution engines. Together, we are excited to push forward the state of the art in distributed data processing."
Meanwhile, monitoring software Eagle, originally developed by eBay – which claims to operate one of world's largest Hadoop and Spark platforms – has been announced as a TLP after its donation to the Apache Software Foundation in 2015.
Like all of Apache's software, Eagle is open source, and is an analytics tool for identifying security and performance issues "instantly" on big data platforms, including Hadoop, Spark, and NoSQL.
Eagle analyses data activities, yarn applications, JMX metrics, and daemon logs for an alert engine that identifies security breaches, performance issues and other similar insights.
"Apache Eagle is a great monitoring and alerting solution designed for large-scale distributed environment," said Chad Chun, director of analytics data infrastructure at eBay. "It was originally intended for security monitoring and quickly became a generic solution for allowing domain experts to create their own monitoring applications on top of Eagle." ®