Databricks, the company behind the popular open-source big data tool Apache Spark, has released an ingest technology aimed at getting data into data lakes more quickly and easily.
Auto Loader is a file source that helps load data from cloud storage continuously and "efficiently" as new data arrives, which the company claims lowers costs and latency. The new "cloudFiles" automatically set up file notification services that subscribe to file events from the input directory and process new files as they arrive via the Apache Spark-wrangling biz's Delta Lake, a storage layer for data lakes.
Bharath Gowda, Databricks veep for product marketing, said: "There's a lot of data already sitting in the cloud in the form of object storage. Companies still continue to push data into data lakes, but they are building in a lot of complex scheduling because they have to keep track of when data arrives and duplication. We've built Auto Loader which makes it really easy for data engineers to point to any kind of object storage bucket, like AWS S3. It automatically takes care of the duplication and it runs the job for them, converting the data into Delta Lake format."
Databricks launches open-source project to drain all your data swamps into info lakesREAD MORE
Databricks said Auto Loader avoids file state management by incrementally processing new files as they land in cloud storage. Scalability is improved by leveraging cloud services and RocksDB, an embedded database for key-value data, without having to list all the files in a directory.
At the same time the company has inked deals to work with data technology firms Fivetran, Qlik, Infoworks, StreamSets and Syncsort to provide built-in integrations to Databricks Ingest for automated data loading. Azure customers can already use the Azure Data Factory while Databricks is lining up agreements with Informatica, Segment and Talend to do the same kind of thing.
Lakehouse? Say what?
The new technology arrives with what the firm described as a new "data management paradigm", which it claimed will combine the best of the data warehouse and data lake approach. In a flash of inspiration, it dubbed the paradigm, er, "lakehouse". Regardless of whether the moniker catches on, Databricks is moving the business towards bringing stronger governance and performance to data lakes as users seek business intelligence (BI) from these repositories, rather than from data warehouses only.
"There is more and more BI being done on data lakes," Gowda said. "That is a big change because people don't think about data lakes and BI together. We have solved some of the traditional challenges of performance as well as reliability."
Data warehouse vendors would, of course, disagree and say data lakes should be for ingest and experimentation only. Production machine learning and analytics should act on data in the warehouse.
Ultimately, as enterprise data has become more diverse with the arrival of unstructured and high-velocity data, more vendors are fighting over a bigger pie. Investors will watch who gets the larger piece while users will just hope they can get it all to work together given the constraints of their current environment and budget. ®