This article is more than 1 year old

Tiny Uber offshoot tries to do for data lakes what Snowflake did for data warehousing

Onehouse's Apache Hudi managed service makes sense but has a long way to go, analysts say

Tiny Californian startup Onehouse has won $8m in seed funding from which it hopes to grow a business worthy of taking on the giants of data engineering.

In a move parallel to the one pulled by multibillion-dollar cloud data warehouse slinger Snowflake, the firm founded from an Uber engineering project is offering data lake technologies as a managed service in the cloud – something of a first. The aim is to make data lake projects faster, cheaper and easier than before.

Analysts agree the minnow has a good chance, but it will face stiff competition and the challenge of cementing the "lakehouse" concept in users' minds.

The Onehouse service is based on Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals. It is an open-source framework developed by Uber in 2016 to manage the storage of large datasets on distributed file systems. Onehouse founder and CEO Vinoth Chandar worked on that project.

Speaking to The Register, Chandar explained: "In 2015 and 2016, Uber was growing very fast and launching in new cities almost every day. We were running a data warehouse [based on HPE Vertica, now part of Micro Focus], but we were starting to hit the limits of the amount of data that we can store in the warehouse and serve out in a cost-effective way.

"We also had a lot of data-driven products or machine learning-based products: we wanted to credit ETAs, we wanted to score the trips for safety, new use cases were coming all the time."

The team knew about data lake technologies for storing and managing data for machine learning, but it also wanted to work on transactional data.

"We wanted to be able to replicate all the transactional data, which is highly structured actually, and get them into the data lake very quickly. And when we looked at the warehouse, they had scaling challenges but it had all the transactional capabilities that you need: you can take trip records from an upstream database and just replicated the changes in the data warehouse. But there was no technology like that to do that on the data lakes," Chandar said.

Hudi (pronounced hoodie) is now used by large corporations and internet firms including Walmart, GE Aviation, and HSBC. The point of the open-source project is to be able to replicate large amounts of transactional or event stream data into a data lake without hassle, Chandar said.

"We're trying to do data warehousing and data science on the same system. By bringing transactions, we just made the traditional BI and warehousing analytics workloads run much better in a more cost-effective way even on the data lake."

The approach allows users to store data in open cloud formats, and pick their own query engines. Onehouse offers Hudi-as-a-service, slashing engineering time in getting projects up and running, Chandar claimed. "We routinely see is it still takes six months to one year for companies to hire an engineering team, train them on all these new technologies, build a lot of these pipelines, and bring the lake to life. That process is pretty time consuming right now."

Stalwarts in the data warehousing space might want to point out that they support machine learning on their already transactional-data-friendly systems. Teradata, Oracle, and Snowflake all have their own stories to tell here.

But Chandar argued that it is much more efficient in terms of underlying storage to bring the transactional data to the data lake. While warehousing firms were supporting programming APIs and some data frameworks used in data science, requirements for more ambitious projects went well beyond that.

"It could be good for a segment of users, but Apache Spark or Flink access cloud storage directly," he said. "They don't need any intermediate servers and the architecture itself is a lot more scalable. If you look at warehouses houses, typically you deploy 10s of servers. If you look at a deep learning data science pipeline, then you'll see that they routinely run hundreds of like nodes across. There is an intrinsic cost/performance/scale barrier here that prevents people from running these more serious, data science workloads on data warehouses."

Meanwhile, the query problems were also fundamentally different. Data warehousing optimisation is broadly about reducing the amount of data the system looks at – with 30 years of work from companies including Oracle and Teradata going into the problem. Data science is more about looking for patterns across all data, he argued.

The lakehouse concept is not new, though. The idea of bringing data warehousing and BI-type problems to data lakes has been promoted by Databricks, which itself was first built around Apache Spark, for more than two years. It's now supporting BI lingua franca SQL to boot.

Chandar said: "I think it's similar but there are some key differences. First, we are not trying to build an enterprise free version of Hudi. The problem that we are trying to solve is to bring in a new model managed data lakes. We're trying to bring in that product category. Secondly, we want to keep Hudi more open, not just the format, but all the services that are on top it."

Analysts' reaction to Onehouse's approach was largely positive, with some caveats.

Andy Thurai, vice president and principal analyst at Constellation Research, said the firm had a chance of replicating Snowflake's success in building a managed-service business. "If Apache Hudi picks up steam. There is a precedence set with many unicorns establishing this model from Redhat to Snowflake to Confluent. However, if the adoption picks up then so will the competition. There is nothing stopping other companies from trying to offer a Apache Hudi managed service, especially the big boys including AWS and Google."

Hyoun Park, CEO and chief analyst at Amalgam Insights, said there was some truth to the claim that the lakehouse could handle storage more efficiently than a warehouse but it would depend on what the user was trying to do.

"Data warehouses typically have processed data that is cleansed while lakehouses consist of larger and less refined pools of data. From a practical perspective, data lakes are more advantageous because they can use data from the original source more easily without as much effort on cleansing, formatting, and prioritizing data.

"The flipside is that the warehouse will have more performant data and will likely be smaller data. Whether one is ultimately more efficient than the other from a storage perspective is dependent on the type of data involved, data retention, nature of machine learning efforts, and other data, analytic, and model management considerations that could swing the pure total cost of ownership one way or the other."

Park added that there were parallels to Snowflake's approach in creating a new managed service category, but Onehouse would face greater obstacles.

"Snowflake benefitted both from supporting an existing enterprise paradigm and having outstanding sales execution. Onehouse faces a slightly more difficult challenge in that the data lake is not as prevalent as the data warehouse was in the enterprise so there is still some general education needed and some agreement as to what a 'standard' data lakehouse should look like.

"However, Onehouse benefits from the fact that enterprises generally see the value of having a data lakehouse for a long-term data repository, but generally lack the expertise to set up the lakehouse. By providing a managed service, Onehouse will quickly be able to bring in analytics and data resellers who can help enterprises with a data lake approach.

"Being first to market with a managed data lakehouse approach and with a deep background in Apache Hudi, Onehouse has an offering that could quickly rise to billion-dollar unicorn status by ramping up businesses to support large lakes of data and to take over as Hudi project deployments become unmanageably large."

Onehouse is a tiny business in tech terms with $8m new funding and 15 employees. With Databricks' IPO expected this quarter, it may help the firm in cementing the lakehouse concept on which it hopes to build its managed service business. ®

More about

More about

More about

TIP US OFF

Send us news


Other stories you might like