Storage

This article is more than 1 year old

Databricks launches open-source project to drain all your data swamps into info lakes

From the creators of Apache Spark, comes a new tale of friendship and imagination

Wed 24 Apr 2019 // 23:42 UTC

American startup Databricks, established by the original authors of the Apache Spark framework, has launched an open source project designed to solve the reliability issues plaguing data swamps – those huge (cess)pools of raw corporate data that are supposed to deliver value from analytics.

The Delta Lakes project is deployed on top of the existing data lake, requiring no change to the underlying architecture. It is compatible with batch and streaming data, can check data quality and schema, and doesn’t allow broken datasets to mess with the algorithms.

Delta Lake also adds quarantine functionality for data that doesn’t make the grade, and there’s a function called Time Travel that helps developers access earlier versions of their data for audits, rollbacks or reproducing machine learning experiments.

And finally, Delta Lake can help hunt down every copy of a particular dataset in a short amount of time, since it is using distributed processing power to handle all of its metadata – something that could be extremely useful for GDPR compliance, among other things.

The platform can be plugged into any Apache Spark job as a data source. It is 100 per cent compatible with Spark APIs, and its creators say it is “like a sibling” to Spark.

Investors dump $250m on analytics biz like a ton of Databricks

The data in a Delta Lake is stored in a tried and true Apache Parquet columnar storage format, which will be familiar to anyone who has ever dealt with Hadoop.

The software has been open-sourced under Apache 2.0 license. For customers who are less inclined to dig into the source code, Delta Lake is available as a managed service from Databricks, hosted with either AWS or Azure.

Databricks was co-founded in 2013 by a team of academics that met at Berkeley, including computer scientist Matei Zaharia, who developed Spark as a PhD thesis in 2009 and later co-created the Apache Mesos cluster manager. Today, the company employs around 700 people and has 2,000 customers.

“We’ve believed right from the onset that innovation happens in collaboration - not isolation. This belief led to the creation of the Spark project and MLflow. Delta Lake will foster a thriving community of developers collaborating to improve data lake reliability and accelerate machine learning initiatives,” said Ali Ghodsi, co-founder and CEO of the company, and adjunct professor at Berkeley.

Databricks says the platform has already been deployed in production by businesses like Viacom, Comcast, Edmunds, Riot Games, Zeiss, Conde Nast and McGraw Hill. ®

Topics

Special Features

Vendor Voice

Resources

Storage

Databricks launches open-source project to drain all your data swamps into info lakes

From the creators of Apache Spark, comes a new tale of friendship and imagination

Investors dump $250m on analytics biz like a ton of Databricks

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

Databricks claims its open source foundational LLM outsmarts GPT-3.5

Microsoft, Databricks double act tries to sew up the data platform market

Databricks' lakehouse becomes foundation under fresh layer of AI dreams

Industrial systems integrating digitalisation

Microsoft touts mirroring over moving in data warehouse gambit

Databricks cements Arcion Labs deal, will absorb its data access tools

Databricks shakes VC money tree and $500M falls out

Tabular's Iceberg vision goes from Netflix and chill to database thrill

Databricks puts cards on the table format as Snowflake looks for more players

Databricks snaps up MosaicML to build private, custom machine models

Teradata chases hyperscaler, SI partnerships in cloud push

Databricks promises cheap cloud data warehousing at an eighth of the cost of rivals

About Us

Our Websites

Your Privacy