Big data may or may not pan out for the users, but it is a bit of a boom for IT vendors, who are scrambling to prove their data analytics chops and go for the easiest money in the market these days. And to that end, supercomputer maker Cray is setting up a dedicated division to chase big data biz.
The division, called YarcData, is a bit of a private joke. YARC is an acronym that is short for "Yet Another Router Chip", and it is the architectural name that Cray slapped onto the high radix router at the heart of the experimental "BlackWidow" supercomputer. This was commercialized as none other than the "Gemini" XE interconnect inside its latest XE6 Opteron-based massively parallel supers as well as the XK6 hybrid Opteron-Tesla machines. Yarc is also Cray spelled backwards, so presumable the new division is "a tad Cray."
Cray already had a knowledge management practice, but has decided to create a proper division – pulling in employees from research and development, marketing, sales, services, and support and dedicating them towards creating and supporting hardware and software for running big data and analytics workloads (as distinct from the kinds of simulation workloads that Cray's gear generally runs).
"Cray is best known for building supercomputers that can run massive scientific and engineering simulations, and from that work we have developed unique technologies and amassed significant experience working with some of the largest data-intensive environments in the world," explained Peter Ungaro, Cray's president and CEO, in a statement announcing the new division. "This makes our entry into the big data market a natural evolution."
Cray has hired a manager from outside the company to run the division: Arvind Parthasarathi, who was named senior vice president and general manager of YarcData. Prior to joining Cray, Parthasarathi was senior vice president and general manager of Informatica's Master Data Management (MDM) business unit, and he was previously vice president of product management for the company's data quality products. (Which means, by the way, that Parthasarathi has a keen understanding of the fact that the biggest problem that big companies have with big data projects is that their information is largely garbage.)
Before joining Informatica, Parthasarathi was director of product management at i2 Technologies (now part of JDA Software), running its RFID, product information management, supply chain integration, and supply chain event management products. He started his career at Oracle, where he was a product line manager in charge of the software giant's Intel Technologies division. Parthasarathi has a BS in computer science from the Indian Institute of Technology and a MS in computer science from the Massachusetts Institute of Technology.
So here's the fun bit: Trying to figure out what Cray is actually going to do in the big data racket. Cray did not speak of such things today, of course, but here's what is obvious from El Reg's systems desk. First, Cray can build server clusters with tens of thousands of cores and wonking clustered file systems with a high-speed XE interconnect linking nodes to each other. If you could beef up a Cray XE blade with some disk drives, you could make a hell of a Hadoop cluster.
Also, the Cray Linux Environment (a variant of SUSE Linux) has a nifty feature called Cluster Compatibility Mode, which makes the XE interconnect look like a standard Ethernet controller as far as Linux applications are concerned. CLE 4.0, the latest release, supports the Java JDK 1.6.0 and can therefore run Java applications.
And the Hadoop MapReduce algorithm and its HDFS file system is a humongous Java app. At the moment, Hadoop tops out at around 4,000 nodes maximum, and Cray could certainly help the open source Apache project do a better job scaling across more nodes. There's no reason why the open source R stats program could not be parallelized, as Revolution Analytics has done, and run across a Cray XE6 super – and run in conjunction with Hadoop, chewing on the reduced data.
Supercomputer rival Silicon Graphics has been going on about how its shared memory parallel supers, the UV 1000 Xeon-based machines, can scale Windows Server 2008 R2 across 256 cores and 2TB of memory – the upper limit of that Microsoft operating system – making it an ideal box for running big databases for online transaction processing and data warehousing. Since last fall, SGI has been selling variants of its Rackable rackish servers with the Cloudera CDH3 commercial Hadoop distribution. SGI has taken down a number of Hadoop deals with as many as 1,200 nodes each in the quarter ended in December.
Cray would have to do some substantial engineering to the XE interconnect to create a shared memory architecture that could match the Windows Server scalability that SGI has. But on parallel commercial workloads like Hadoop, and maybe even on NoSQL data stores, the engineering job is do-able. ®