Like the idea of chewing on terabytes data using Google’s MapReduce but think it's too slow, too hardware-hungry and too complicated?
A fledgling big-data analytics venture reckons it’s got the answer - a Hadoop programming framework built using Java it claims is 20 times faster than using ordinary Hadoop and that it claims uses less data-centre hardware. It’s easier to program, too, they claim.
Heard it all before? So have we, only this time it’s not some startup backed by VCs looking to cash in and bail out on the big-data wave.
It’s ET International – a 12-year-old venture you probably haven’t heard of but which was founded with research backing from the US Department of Defense, its first customer.
Customers since then have spanned the Pacific Northwest National Laboratory and organisations in oil and gas.
ETI is the brainchild of MIT parallel computing and data flow brainbox Guang Gao – now a professor with the University of Delaware with several awards to his name.
His company claims it is applying what it has learned while working for those early customers on big data, and will soon spin out the big-data venture as a separate company.
ETI claims its product, called HAMR, which hit beta last month, can run the same job as Hadoop but using less servers – just one-tenth of the nodes. It also runs entirely in memory.
“HAMR is an evolution of MapReduce,” the company’s chief architect Brian Heilig told The Reg recently.
“It’s a complete replacement of the MapReduce engine,” he said, but added it can still read and write to the Hadoop Distributed File System (HDFS) and there’s also a Yarn plug to run Hadoop 2.0. Hadoop is the open-source build of Google’s framework from Cloudera chief architect Doug Cutting and under the auspices of the Apache Software Foundation (ASF).
“We took these MapReduce concepts, split them up and created a runtime framework called HAMR,” Heilig said.
The key to HAMR is something ETI calls Flowlets, a patent-pending API set.
Flowlets are nodes in an execution graph on a network that contain lots of different data held in partitions; the data is re-assembled using a key-value pair. Keys are sent to the partition by taking the hash code of the key and modelling it by the number of partitions.
HAMR uses a proprietary networking layer to re-assemble the value pairs and appropriate partitions.
It is built Apache’s ZooKeeper for centralised service management and configuration, the Log4j Java logging facility is also bundled in, along with Apache Curator. Users need to install the RabbitMQ enterprise message system, which is based on Advanced Message Queuing Protocol (AMQP).
The goal is to use MapReduce for more than simple batch-slinging for web page indexing and crunching log files, and to more easily employ it in more evolved tasks such as machine learning and graph algorithms.
These rely more on iteration of data and keeping data in memory, rather than the simple, massive batch-scheduling characteristic that is the hallmark of Google’s MapReduce.
Heilig reckons MapReduce will live on, just living out the simple web-page pounding on thousands of servers shackled together – the role for which it was built and for which it is used at Google. Its open-source offspring Hadoop, however, has raised expectations and opened up new potential uses, he says.
It is within this area that Heilig reckons HAMR should play, offering big data analytics and machine learning in memory running on an ever smaller number of machines. ®