Silicon Valley startup MapR has unveiled what it calls the "next generation" of Hadoop, revealing that this revamp of the open source distributed number-crunching platform drives the Hadoop appliance recently announced by EMC.
According to MapR CEO and cofounder John Schroeder, the company has rebuilt Hadoop's storage system, job tracker, and name space, providing more rapid access to data and improving the platform's ability to scale. MapR's Hadoop is not entirely open source, Schroeder says. It is now in use by "quite a few" customers, including EMC, whose Greenplum HD Enterprise Edition of Hadoop is essentially MapR technology.
Based on Google's back-end infrastructure and named after a yellow stuffed elephant, Hadoop offers a distributed file system (HDFS) and a distributed number-crunching platform (Hadoop MapReduce) as well as various other tools. It's typically used for offline data processing, though a sister project, Hbase, offers a real-time distributed database.
The existing Hadoop storage system, Schroeder says, is too limiting. "It's kind of like writing to a CD-ROM. You can write data to a file, but you can't do random reads and writes, and you can't do multiple concurrent readers and writers. It forces Hadoop to become more of a batch mode platform," Schroeder tells The Register. Over the past two years, Schroeder and his team have revamped the storage system, working to ease access to data not only from Hadoop itself but from other platforms as well.
This involved setting up a network file system (NFS) mount. "What we've done is rearchitect the storage services so that they provide random read and random write for multiple readers and writers, and then we expose that as an NFS mount, so in addition to being able to use that data from Hadoop APIs, you can use all your standard Linux tools and Unix tools and applications," he says. "You can create real-time data streaming out of Hadoop. You can make Hadoop look like a big C: drive on your Windows desktop."
MapR has also revamped Hadoop's Job Tracker, which distributes jobs across a Hadoop cluster and then manages their execution. Typically, Job Tracker is a single process on a single machine, but MapR has introduced a Job Tracker that's distributed across several machines. "Job Tracker failures are fairly common. They have a tendency to crash clusters and leave applications in an unknown state, so we've introduced a high-availability Job Tracker," Schroeder says.
Similarly, the company has rebuilt another single point of failure: NameNode, Hadoop's global namespace. MapR offers a distributed NameNode that scales to more files. "If you pull a node out of the cluster and it's running Job Tracker or NameNode, everything will continue to run without any impact."
Ben Werther – vice president of products at DataStax, a company that has put Hadoop MapReduce atop the open source Cassandra database and file system with a platform called Brisk – told us that the NameNode single point of failure is a "major, major problem" with Hadoop HDFS. Brisk avoids the problem because it does not use HDFS.
Facebook is already running a version of HDFS with no single point of failure for its new Messages platform, which makes use of HBase. And according to Amr Awadallah, the CTO of Hadoop startup Cloudera, the open source version of Hadoop will eliminate the single point of failure in "a few months".
Separately, MapR has introduced a new "data protection layer" with its Hadoop, letting you take a snapshot of your data every so often for recovery purposes. "Hadoop offer replication. You can offer multiple copies of data. But that doesn't guard you against user error. If I come in and delete a directory, it's gone," Schroeder says. "With MapR, if you do have a user or application error, you can go to a snapshot directory and recover."
Cloudera's Awadallah takes issue with the MapR platform because it's not open source. "Cloudera firmly believes in the superiority and the many short-term/long-term advantages of open source over proprietary implementations of Apache Hadoop. Open source Hadoop benefits from a lot of minds continually working to improve and test the code," he tells us.
"Most importantly, the reason why CIOs/CTOs love Apache Hadoop is eliminating the nontrivial risks of vendor lock in, i.e., nobody is holding a shotgun to their head asking them to pay more fees or else the [repository] hosting all their data gets nuked. Cloudera only makes money from our customers if we are delivering value to them, but if we cease delivering value then they can continue hosting their data in HDFS and use a different vendor for the support and management applications."
Cloudera offers its own completely open source Hadoop distro, but it also offers a for-pay "enterprise" version that puts various proprietary tools atop that platform.
Schroeder claims that MapR's platform is "at least two times faster" than existing Hadoop distros, citing standard benchmarks such as Terasort. This, he says, lets you run Hadoop on fewer servers. But this benchmarks cited by Schroeder may be optimized for certain workloads and not others. Schroeder was noncommittal about plans to open source the company's work or to offer its platform outside of OEMs such as EMC. ®