EMC wants to be the Linux of big data
Opens up Chorus tool, borgs agile coders Pivotal Labs
To broaden its reach in the big-data arena, disk-array maker EMC's Greenplum division, which peddles data warehousing and Hadoop appliances and software, announced that it will open source its Chorus management and collaboration tools. EMC also has acquired Pivotal Labs, experts in agile programming, to help it build better big-data software and, equally importantly, help others do so.
EMC has always been serious about data, but in case you haven't noticed it, the company is now very serious about big data and the software that is used to chew it up and regurgitate useful bits of information.
"Having database-kernel developers doing a UI was not working out really well," conceded Luke Lonergan, CTO at the Greenplum division to El Reg in an interview after EMC made its announcements in a webcast presentation hosted in San Francisco and New York.
About a year ago, Greenplum hired Pivotal Labs, which was founded in 1989 and which has a couple hundred code-slingers that could teach the database programmers some new tricks. They got the Chorus product back on track, and then EMC pulled a Victor Kiam and liked the company so much it bought it today for an undisclosed sum.
Greenplum previewed the new Chrous 2.0 tool in December 2011, it being a central feature of its Unified Analytics Platform. The idea is to take data warehouses running the Greenplum variant of PostgreSQL and Hadoop clusters running either Greenplum HD (the open source distro) or Greenplum MR (the open-core version from MapR Technologies that EMC resells) and mash them up and glue them together using the Chorus collaboration environment.
Gelsinger: Open source Chorus 'is a big step for us'
Chorus 2.0 has a Facebook-style collaboration interface to data sets and analytics tools so people can share data. It also has a full metadata search so researchers can do data exploration in either structured or unstructured data.
Equally importantly, Chorus 2.0 can spin up a sandbox inside a data warehouse or Hadoop cluster, or spin up a data mart inside of a VMware virtual machine, so different "data scientists" can chew on different parts of the data and not create physically separate data silos running on other machines.
The current Chorus 1.2 does not know how to talk to Hadoop, and it can't spin up a personal sandbox for an analyst. Chorus 2.0 will also have integrated data visualization tools to help analysts and other big-data users get a feel for the shape of the data so they know where they might need to drill down more to try to understand some aspect of their business better.
Chorus 2.0 has been in beta testing for the past four months, says Lonergan, and during a tour of the Pivotal Labs facility in San Francisco that was part of the webcast, one of the code-slingers said that the product was in release-candidate phase right now. Lonergan later confirmed to El Reg that Chorus 2.0 will ship on March 23.
During that tour of Pivotal Labs – the company also has offices in New York and had an office in Singapore for a while – it was shown how the company has teams of a dozen or so people coding away on projects with pairs of programmers coding together on parts of the code.
Every day or so, the programmers play musical chairs, and over the course of a week or so, everyone has been teamed up with everyone else on that development team – the Chorus team, for example, has ten people on it.
The idea is that both coders in a pair do some programming, and no one programmer becomes a subject-matter expert on any piece of the code. Everyone gets to know all of the code this way – not by studying it, but by working on it.
Every time the code changes, a build is done to the code. If it fails any tests, it is immediately flagged as failing and everyone on the team can see the issue – there is tremendous peer pressure to get the code fixed. You make iterative changes in the code, and you fix things as you go along rather than waiting until the end of a protracted development process.
EMC did not disclose the price it paid to acquire Pivotal Labs, but said that the company would remain an independent unit, much as Greenplum, VMware, RSA Security, and others have been left reasonably untouched by the EMC mothership after being acquired.
Pivotal Labs is privately held and sells a tool called Pivotal Tracker that is a scheduling system for agile programming, forcing developers to program down into small chunks, called stories, that they work on in teams. There are 240,000 developers using the Pivotal Tracker tool today, and EMC said in a statement that it was committed to investing in this tool and letting Pivotal Labs do what it does.
Pivotal Labs is big on Ruby on Rails. In fact, according to Lonergan, it has been instrumental in getting Greenplum to port the Chorus tool from the Java back-end used with the 1.2 release to Ruby on Rails with the 2.0 release.
Scott Yara, senior vice president of products at the Greenplum unit, said that as Greenplum got exposed to the coders at Pivotal Labs and the new techniques, its own programmers starting thinking outside of the box about Chorus, social media, open source, and what the product could be.
As far as bringing social media to the Chorus tool, which the company started mulling four years ago, before EMC even came a-calling, Yara said that this "seemed like a stretch."
But as time went by, "people kept pushing us," said Yara, and they started thinking about the big platforms that have established themselves in the past couple of years – Linux, Java, Hadoop, and Android, just to name a few – and they all have one thing in common: they are open source. And thus the idea was born to take the Chorus tool open source and position it as a platform for integrating big-data applications.
"This is a big step for EMC," explained Pat Gelsinger, president and COO of EMC's Information Infrastructure Products group, which includes Greenplum and a bunch of other products. "We've helped open source, but we have never been open source."
EMC did not provide a lot of details about the OpenChorus project, but the company said that it planned to have the code open sometime in the second half of this year.
Unlike Hadoop and other big-data projects, where the open sourcing was done to solicit help with actually completing the code and ruggedizing it for commercial use, EMC said that it was taking the Java and Android models, where the development work would be done largely by the sponsoring company.
The opening up of the Chorus source code is about making companies comfortable in investing in Chorus – they know it can survive any vendor – and getting developers to code applications that work through it and bring extensions to the tool itself. EMC is not looking for help on coding Chorus per se, but it sounds like it could have used some.
Lonergan would not reveal if EMC has made a decision about what license under which the Chorus tool will be distributed, but he hinted that the kind of "open" licenses used by Apache projects were appealing and the more restrictive GNU General Public License was not. "Our objective is to have a license that makes this partner-friendly and community-building," Lonergan said.
It will be interesting to see how other big-data players – IBM, Oracle, Teradata, and a slew of other smaller players such as Cloudera, Hortonworks, and so on – will participate in the OpenChorus community and link their products into the tools. Maybe they will play, and maybe they won't. ®