With so much of its future sales and growth staked on smart infrastructure and the data analytics that enables it, it comes as no surprise that IBM has taken a shining to the open source Hadoop big data crunching software that has found a loving home at the Apache Foundation. Today, IBM announced it has created a commercial version of Hadoop as well as some add-ons and - you guessed it - implementation services to make Hadoop more consumable for the Global 20000.
Not everyone is a Google, where the MapReduce distributed data cruncher and its related file system was created, or even a Yahoo, where Hadoop was nurtured to do what Google does - but in an open source, community-driven fashion. Hadoop is used at Yahoo! and Facebook and Twitter, and it helps drive a portion of Microsoft's Bing search engine. But it is not widely understood in the corporations where IBM does its business.
Bernie Spang, director of product strategy for database software and systems at IBM, says that the company needs Hadoop to complete its data analytics hat trick. IBM has traditional data warehousing and predictive analytics in its InfoSphere, Cognos, and now SPSS products, which can extra data from transactional systems to help companies make better decisions. And it has the "System S" InfoSphere Streams system, which debuted as a prototype a year ago to mash up streaming data from text, video, and audio streams and mix it with databases to create something that is a bit more real-time than a data warehouse, helping governments and companies wade through mountains of data to make decisions (like trade options a hell of a lot faster than most systems can, as the prototype did).
Spang says that IBM needs to offer a product that does the "big data" crunching that the Googles of the world do because its own customers have loads of structured and unstructured data that can be sucked into a Hadoop file system and chewed on using MapReduce for a wider, finer-grained, and more long-term analysis than can be done with a data warehouse or stream system.
And that is why IBM is creating its own distro of Hadoop, which is called InfoSphere BigInsights. Spang called BigInsights an enterprise-ready version of the Apache Hadoop code that IBM will package up and install for customers who want to build their own Hadoop grids. IBM has done about a dozen Hadoop installations to build up experience setting up the code and systems, and now feels it has enough experience to offer commercial support and various services, including the Hadoop software but also services and expertise relating to how Hadoop can be used for risk management and analysis at financial firms or for all kinds of cross-linking in social networking and online entertainment applications. IBM will plan your Hadoop installation for you, set it up, and even monitor it for you. Just get out that checkbook.
IBM could have just done the easy thing and partnered with Cloudera, which back in March 2009 launched a commercialized version of the Hadoop Distributed File System, the MapReduce parallelization and data-crunching algorithm to chew on Webby data, and the Hive client library associated with Hadoop. But Big Data is important enough that IBM feels compelled to offer its own distro.
<o<While IBM is now a competitor to Cloudera, Big Blue says it will participate with the members of the Apache Hadoop community, singling out Cloudera and Karmasphere, which has created a graphical tool for debugging Hadoop apps, by name.
Cloudera welcomes IBM's arrival. "I am excited to see more organizations like IBM get behind the Apache Hadoop project," said Cloudera's Doug Cutting, the man who founded Hadoop. "IBM has been working for some time on Hadoop-related projects for its internal use such as BigSheets and I am looking forward to their investment in the core open source platform development as well.
"At Cloudera we've seen incredible Hadoop uptake in mainstream enterprises which has been reflected in the growth of our own business. I see no end to the number of applications of this new technology. IBM's entry means more open source contributors will help expand the horizons for Hadoop around the world."
The InfoSphere BigInsights distro will have some home-grown IBM software as well, including a technology preview of something called BigSheets that Spang says is basically a spreadsheet front-end running in a Web browser that is used for consolidating and visualizing the chewed data coming out of Hadoop, which can be terabytes or petabytes of Web pages and other kinds of unstructured data.
As an example of how BigSheets can interface with Hadoop, IBM is working with the British Library to archive and preserve 5 TB of Web pages culled from sites with the .co.uk domain. The BigSheets interface will let researchers, academics, and students to chew on this data and search it in more sophisticated ways than is possible using a search engine.
IBM is not divulging its prices for the BigInsights Hadoop distro or what the various installation and support services cost. The BigInsights distro is available today. It is not clear when BigSheets will move from technology preview to production, but you can find out more about the software here. Spang said that IBM has other tools to make Hadoop do more tricks, but it is a fair guess that these will cost more than peanuts. ®