The Apache Software Foundation has today announced Apache Arrow, its new project which aims to provide a cross-system data layer for columnar in-memory analytics.
While Apache projects normally go through incubation periods, Arrow has been immediately announced as a Top-Level Project, and its code – seeded from the Apache Drill project – is being released today.
Apache Arrow is intended to establish a "de-facto standard for columnar in-memory processing and interchange," although its first formal release is a few months away.
Jacques Nadeau, veep of both Arrow and Drill, modestly said: "We anticipate the majority of the world's data will be processed through Arrow within the next few years."
Talking to The Register, Nadeau said that "key guys" from other Apache Big Data projects – comprising Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark and Storm – "as well as established and emerging Open Source projects such as Pandas and Ibis" are involved.
In addition to traditional relational data, Arrow supports complex data with dynamic schemas. For example, Arrow can handle JSON data, which is commonly used in IoT workloads, modern applications and log files. Implementations are also available (or under way) for a number of programming languages including Java, C++ and Python to allow greater interoperability.
Todd Lipcon, founder of the Apache Kudu project and member of Arrow's Project Management Committee, said: "Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."
Apache claimed that, for many workloads, 70-80 per cent of CPU cycles were spent serializing and deserializing data. Arrow is intended to solve this problem by "enabling data to be shared between systems and processes with no serialization, deserialization or memory copies."
Allowing multiple systems to work better without the overhead of moving data between them is what Apache will do, Nadeau told us.
"You have to move data around between different nodes, and potentially move it between Java and Python, for instance," Nadeau added, "so any time doing this between two different programming environments, or engines, all of those transfers benefit from the lack of serialization/deserialization."
"An industry-standard columnar in-memory data layer enables users to combine multiple systems, applications and programming languages in a single workload without the usual overhead," said Ted Dunning, Vice President of the Apache Incubator and member of the Arrow PMC. ®