Companies may be excited about doing Google-style analytics on all aspects of their business with Hadoop and other "big data" tools, but big businesses are bracing for bigger phone bills as big data is starting to generate big traffic across the distributed operations of enterprises.
This is music to the ears of Infineta Systems, a maker of a WAN optimization appliance that claims to have optimizations specifically tuned for big data workloads as opposed to the data-light enterprise applications that are often streamed across multiple data centers or broadcast out to remote branch offices through WAN links. To try to explain the network issues that businesses will be facing as they deploy analytics, Infineta commissioned Internet Research Group to talk to early adopters of commercial Hadoop and study the Big Traffic effect.
The Hadoop MapReduce algorithm and its Hadoop Distributed File System manages the replication and movement of data within the Hadoop cluster, and this is not where IRG says there will be a traffic jam. It is how data is staged going into the Hadoop clusters and what happens to it after it has been mapped and reduced that will cause the problem.
At companies like Google and Yahoo!, the Hadoop clusters that do the analysis and drive ad placements for those sites, targeting the best ad to a search item or a user profile, are all taking place within the same monster data center. But with big business, different parts of the operation are located in different parts of the world and therefore in different data centers.
Data has to be culled from back-office and other ERP, CRM, and SCM systems software (where data is generally stored in relational databases) and paired with consolidated information residing in data marts and data warehouses and with unstructured data such as log files and clickstreams from Web pages to make the correlations that get the right information in front of users' eyes and therefore help drive more business.
In some cases, departments or divisions within a company might not be retaining information that is suddenly thought of as useful for making correlations, and worse still, they may not have the bandwidth to continuously stream this data back to the central data center (or data centers) where Hadoop clusters are eagerly awaiting bits to munch upon.
Here's a concrete example. The Big Data, Big Traffic, and the WAN report from IRG, which you can read here (PDF), talks about a financial services company with an insurance division and headquarters in San Francisco, brokerage operations in New York, and banking in Chicago.
It makes sense to put a big Hadoop cluster (or multiple clusters) in one place to get economies of scale, but no matter where the Hadoop data muncher is located, data will have to come out of these three divisions and be integrated and then, after processing, passed back to the divisions for them to make use of in their applications. And, while not at real-time speeds, then in something that the batch-mode Hadoop can closely approximate. With a company running perhaps hundreds of Hadoop jobs daily, this is a lot of back and forth.
While the IRG report identifies a problem that Hadoop adopters will eventually face, it does not offer a solution for solving the problems of determining what datasets are required by the jobs queuing up in the Hadoop clusters and how you control the security of those datasets as they move across the WAN. Given that Infineta sponsored the IRG report, it obviously thinks it has the solution to the Big Traffic problem.
Back in June, Infineta launched a WAN optimizer called the Data Mobility Switch, which El Reg should have covered and didn't but will now tell you about.
Infineta was founded in San Jose, California in mid-2008 and has raised $30m in two rounds of funding; Rembrandt Venture Partners, Alloy Ventures, and North Bridge Venture Partners all kicked in the dough. The company was founded by Raj Kanaya, who was vice president of product strategy and alliances at Citrix Systems and who came to Citrix through its acquisition of the NetScaler, a maker of application acceleration appliance, and Ram Ramarao, who was chief platform architect for the application oriented networking business unit at Cisco Systems. Kanaya is Infineta's CEO and Ramarao is the company's CTO.
While various WAN optimization appliances on the market today are aimed at providing decent application response time and acceleration for branch offices connected back to the corporate data center – what is thought of as a north-south problem in networking – Infineta has been worrying about optimizing the links between data centers – what is commonly called the east-west problem these days and one that requires an order of magnitude more bandwidth.
Infineta is not just aiming its DMS appliance at transactional and batch workloads like Hadoop, passing data back and forth between data centers either before it is chewed or after, but also for the replication and backup that is done over WAN links for disaster recovery and for virtual machine instances that can flit from server to server and data center to data center using live migration. "For these kinds of workloads," Kanaya tells El Reg, "10 Gigabit Ethernet links is becoming "table stakes." And the network requirements are very different from a branch office WAN.
With the branch WANs, you have remote users logging into applications machines in the data center and they generally don't have a lot of capacity in terms of data transfer and they don't mind high latencies. (Well, they mind, but tough.) You have to maintain thousands to tens of thousands of connections with the WAN optimizer and you can get by with a WAN link running at between 10Mb/sec to 1Gb/sec. The main thing is to be steady so users get predictable response.
With the east-west traffic typical of data centers, you have applications talking directly to other applications, and they have both high bandwidth and low latency needs. While you only have to manage hundreds of connections between data centers, the connections generally require more than 1Gb/sec of bandwidth and have a bursty nature as well.
Infineta's Data Mobility Switch appliance
To cope with this new kind of traffic, Infineta has cooked up the DMS appliance, which uses a multicore Octeon MIPS-based processor from Cavium for running the WAN links and network protocols, uses two Xilinx Virtex 5 field programmable gate arrays (FPGAs) to handle the deduplication of data that passes back and forth over the WAN links, and has a Broadcom switch embedded in it as well. (Kanaya says that Infineta is the first company that can run deduping at 10GE line speed rates.)
The DMS plugs into the core data center switches on one side and the WAN routers on the other side and can be used to replace multiple routers and WAN optimization units that would have to be load balanced within both sides of the data center-to-data center links. The design objective of the DMS appliance is to provide five times the data reduction over a WAN link while running at 10GE speeds, which can effectively turn a 1Gb/sec WAN link into something that looks and smells like a 5Gb/sec link and am OC48 link running at 2.54Gb/sec into something that is moving at 10Gb/sec or more. You could get the WAN running at what effectively looks like LAN speeds.
The DMS appliance went into beta earlier this year and shipped in January. The company sells three different models with throughput ratings of 2, 5, or 10 gigabit speeds that range in price from $80,000 to $265,000 a pop. In October, Infineta launched a special rental program that will allow customers to rent a DMS in three-month increments for $25,000 to use it to move data as part of a data center migration. The rental fees can be applied toward purchase. ®