Interview Hortonworks believes it's solved duplication issues in its Hadoop spin that menaced users when incrementally merging data.
And CTO Scott Gnau reckoned customers are heaving sighs of relief after Hortonworks has realised what must have seemed impossible.
Gnau was speaking to The Register at the DataWorks Summit in Munich this week.
The company has acknowledged the difficulties of many workloads in the Hadoop environment when they require "frequent or unpredictable updates".
The biggest problem? Developers are responsible not only for the update logic but also the rollback logic they must detect and resolve write conflicts and find some way to isolate downstream consumers from in-progress updates. Hadoop has limited facilities for solving these problems and people who attempted it usually ended up limiting updates to a single writer and disabling all readers while updates are in progress.
Gnau reckons they've tackled this with version 2.6 of Hortonworks Data Platform — the company's enterprise Hadoop distribution.
Another objective of 2.6 was to avoid the possibility that HDP became an environment simply for copying data and running complex analytics analytics with "interference" from the "real work" taking place in the enterprise data warehouse.
Hortonworks has therefore developed ACID Merge in Apache Hive, rather than introduce it via yet another open-source project.
Initially developed by Facebook, but now part of the Apache Software Foundation, Hive provides a SQL-like interface for querying data stored in Hadoop. "Others might have taken the approach of creating yet another project to go and solve this problem," Gnau told The Reg.
"[But] we really like to focus on making the core platform as good as it can be, and that makes it easier for our customers to consume it because they don't have to rewrite their applications or do anything other than load the new software, implement the performance enhancement capability, and the application just runs."
Gnau reckoned there were "many instances" where the same file loaded twice or the same record was duplicated.
"Being able to do ACID-compliant merge of that data as an incremental data maintenance operation is really important especially as we look at the kind of applications that are going to be driven by the technical query performance, dashboarding and other transaction analytics where you want to ensure that you don't get dirty reads, where you want to ensure that you don't double-count."
ACID compliance is historically a big issue in data processing. "Being able to do data merge and incremental data loads inside of the core platform was possible, but it required a lot of heavy lifting," Gnau said. "Now it's just completely automatic and because we do it inside of the engine we can guarantee that it's ACID compliant."
The response from customers has been verging on emotional, according to Hortonworks' CTO. During a meeting with Hortonworks customers he reckoned some had "just sighed with great relief because it's a problem which they'd been grappling with, trying to solve, and it was taking more work than they wanted it to do.
"I didn't have a lot of customers asking for the ACID Merge," Gnau reckoned, "but it's obviously a really big deal for them – it might have been they didn't ask because they thought it was impossible to do. Certainly it was not without challenge from an implementation perspective." ®