Information - which is what happens to data when you filter out useless stuff and add context so human beings can make decisions - cannot be easily generated or quickly integrated with business processes. The advent of the Internet and its various forms of media complicate the task of turning data into information, what government people are fond of calling "actionable intelligence." Mashing up various text, video, and audio streams with databases and other data storehouses is a grand challenge, one that needs something that looks and smells like a supercomputer.
Which is why the techies at IBM Research have been working for more than six years on a project called the System S. This stream computing system runs on IBM's BlueGene massively parallel supercomputing iron, but it puts the iron to work running very different kinds of software than is typically used in a supercomputer to do weather or financial modeling.
IBM started talking publicly, and very sketchily, about System S back in June 2007, and this week, the company announced that TD Securities, the investment banking arm of Toronto Dominion Bank, has taken the first prototype of the System S machine, which runs a bit of software that Big Blue calls InfoSphere Streams atop a BlueGene/P supercomputer.
The streaming software - which was created at the T.J. Watson Research Center, in Hawthorne, New York - is designed to not just do complex queries against data that is a bit more amorphous than fields stored in a database. But as the streaming part of its name suggests, it's also meant to continuously update that data as it changes in real-time.
As IBM explains in this whitepaper, in a normal information system, you ask a bunch of questions of a relatively static database and you get data that you need to make a decision. With a streaming system, huge amounts of raw information from as many sources as you can stomach are streamed into the box, and the InfoSphere Streams software keeps a database of your queries and constantly updates the data it provides to decision makers.
In one system, you ask a database to list all the people with the last name of Smith who live within 100 miles of the center of the city. With the System S, you ask that question once, and it taps all the available information you feed into it - government databases, Web traffic, email, GPS data, sensors, badge swipes, video feeds, audio feeds, what have you - and it tells you how the Smiths identified in the original query are moving around the city within a 100 mile radius in real-time (presumably when Smiths leave and new ones arrive).
Personally, I can't imagine why anyone would want such information, but remember System S when you are tweeting your freaking brains out like a teenager or sending text messages over your cell phone.
Anyway, the System S super is not just about surveillance, and TD Bank isn't interested in the box for that reason. But the same InfoSphere Streams software can be used to consume vast amounts of news feeds, financial information databases, and other sources of data to make decisions about stock trades, and TD Securities says that it has in fact put the System S through the paces and created an options trading system front-ended by the super that can process 21 times more information that the prior systems that the bank's securities trading experts have put together. (That doesn't mean people are 21 times smarter at using that data, unfortunately).
According to the Financial Information Forum, the amount of data generated by the securities and options trading systems in the world has been doubling every year since 2003, and TD Securities took the System S prototype from Big Blue because it wants to create an options trading system that will be able to cope with the data streams it expects two to three years from now. And IBM slapped the InfoSphere Streams software on the BlueGene/P, its most scalable server, to give TD Securities plenty of scalability room. That said, IBM says that the software works just fine on anywhere from 50 to 500 server nodes and that it did development, testing, and production on a one-rack BlueGene/P machine.
The BlueGene/P super, you will recall, puts four 850 MHz single-core PowerPC 450 chips onto a single processor card and then links them by symmetric multiprocessing so they can share 2 GB of DDR2 main memory. A single rack has 1,024 of these four-core processor nodes, if you can believe it, and would be rated at around 13.9 teraflops of number-crunching performance if it was running simulations.
The prototype options trading system build atop the System S setup is able to crunch 5 million options valuations per second, which is 20 times the record for this kind of trading, apparently. So System S can consume 21 times the data and do options trading 20 times faster. Milliseconds are millions of dollars in this racket, so it is hard to imagine IBM won't be selling these machines to every financial services firm very shortly.
This prototype System S machine installed at TD Securities runs Red Hat's Fedora 8 development Linux for PowerPC chips, which has been tweaked to support BlueGene hardware and software extensions. And strictly speaking, the InfoSphere Streams software is not supported on BlueGene/P iron. But clearly, if you have money, IBM has the support.
You can find a little more detail about the System S and the InfoSphere Streams software here. IBM plans to offer commercial versions of this platform in the first half of 2010, and my guess is that it will be on Power7-based server platforms, not BlueGene/P or its kickers. ®