American startup Databricks, established by the original authors of the Apache Spark framework, has launched an open source project designed to solve the reliability issues plaguing data swamps – those huge (cess)pools of raw corporate data that are supposed to deliver value from analytics.
The Delta Lakes project is deployed on top of the existing data lake, requiring no change to the underlying architecture. It is compatible with batch and streaming data, can check data quality and schema, and doesn’t allow broken datasets to mess with the algorithms.
Delta Lake also adds quarantine functionality for data that doesn’t make the grade, and there’s a function called Time Travel that helps developers access earlier versions of their data for audits, rollbacks or reproducing machine learning experiments.
And finally, Delta Lake can help hunt down every copy of a particular dataset in a short amount of time, since it is using distributed processing power to handle all of its metadata – something that could be extremely useful for GDPR compliance, among other things.
The platform can be plugged into any Apache Spark job as a data source. It is 100 per cent compatible with Spark APIs, and its creators say it is “like a sibling” to Spark.
Investors dump $250m on analytics biz like a ton of DatabricksREAD MORE
The data in a Delta Lake is stored in a tried and true Apache Parquet columnar storage format, which will be familiar to anyone who has ever dealt with Hadoop.
The software has been open-sourced under Apache 2.0 license. For customers who are less inclined to dig into the source code, Delta Lake is available as a managed service from Databricks, hosted with either AWS or Azure.
Databricks was co-founded in 2013 by a team of academics that met at Berkeley, including computer scientist Matei Zaharia, who developed Spark as a PhD thesis in 2009 and later co-created the Apache Mesos cluster manager. Today, the company employs around 700 people and has 2,000 customers.
“We’ve believed right from the onset that innovation happens in collaboration - not isolation. This belief led to the creation of the Spark project and MLflow. Delta Lake will foster a thriving community of developers collaborating to improve data lake reliability and accelerate machine learning initiatives,” said Ali Ghodsi, co-founder and CEO of the company, and adjunct professor at Berkeley.
Databricks says the platform has already been deployed in production by businesses like Viacom, Comcast, Edmunds, Riot Games, Zeiss, Conde Nast and McGraw Hill. ®