This article is more than 1 year old

Trino and dbt open source data tools snuggle closer with integrated SaaS

Managed offerings now work in tandem to crunch data where it resides

Two SaaS products targeting open source data management and analytics technologies have joined forces in a move hoped to attract users who wish to model and manage data for crunching.

The partnership between dbt and Starburst aims to serve a large market by helping prepare data for analytics without moving it, one analyst told The Register.

Starburst is the company built around open source Trino (formerly Presto), the analytics and data lake project originating in Facebook's Hadoop environment, which counts AWS, Salesforce and Pinterest among its community. dbt, on the other hand, is the company built around the open source tool of the same name that helps organizations model, manage and predict the data transformations necessary for complex "internet scale" analytics. The stock market Nasdaq, engineering company Vestas and martech giant Hubspot are among its customers.

Starburst co-founder Matt Fuller said dbt allows analytics engineers to model data in a higher-level language but exports SQL to manipulate data in a database or data lake such as Starburst.

"It's a really complementary technology," he told The Register.

Starburst also allows users to analyze data outside its data lake using SQL, including systems such as MySQL or PostgreSQL, as well as non-relational systems like MongoDB, Kafka and Elastic.

With customers already using Trino and dbt together, it made sense to integrate them in the companies' SaaS products – dbt Cloud for dbt and Starburst Galaxy.

"People previously were using [dbt Core] with Galaxy, but it's a little cumbersome because Galaxy is a fully managed offering and dbt Core is open source, so you have to manage it yourself. With this announcement, you can now use both products that are managed offerings together and that wasn't possible before," he said.

Analyst Kevin Petrie, Eckerson Group research vice president, said the combined service was targeting a large market.

"Enterprise environments are more distributed than ever, with data residing on-premises and in two or more clouds. This makes it tricky to move and prepare data for analytics projects. By using Starburst's federated query engine alongside dbt's transformation engine, data teams can prepare data for analytics without needing to move it. So they can analyze a wider array of data, wherever it sits, for a given analytics project.

"They can use Starburst to query the distributed data, and dbt to clean, model and document it, with no need to ingest it across platforms."

A string of data warehouse and analytic vendors have become interested in offering users the possibility of bringing analytics to data, without moving the data into a data warehouse or data lake. Teradata worked with Starburst to adapt Trino for this purpose in its product QueryGrid in 2020.

More recently, Google BigQuery, Snowflake and Cloudera announced their adoption of Apache Iceberg, the open source data table format from Netflix.

Starburst also has an Iceberg connector, but Fuller argued its approach was more open than the data warehouse vendors when applied to a data lakehouse – that recently coined concept of combining data lakes and data warehouses.

"I'm glad they're finally catching up to understanding the value of Iceberg, but I don't think they quite get it right," Fuller said. "Iceberg and Trino are completely independent open source projects. Combined, they create a truly open data lakehouse. If you do want to use them both together as a commercial offering, there is Starburst Galaxy and Tabular, which is the company behind Iceberg. The difference with Snowflake and the other approaches is they have limitations for it. In some cases, the catalog for Iceberg tables isn't accessible to other tools, for example. There's always like a slight lock-in angle."

Petrie told us: "Enterprises want to consolidate as much data as they can onto cloud platforms such as Snowflake, BiqQuery or Databricks. But data gravity and migration complexity prevent them from moving everything to just one platform. So I think many environments will use both consolidated platforms such as Snowflake and query engines such as Starburst or Dremio to support their analytics projects." ®

More about

TIP US OFF

Send us news


Other stories you might like