Nvidia had better watch out. Texas Instruments is not only its rival when it comes to making ARM processors that might end up in servers someday, but it is also repositioning its digital signal processors so they can be used as math coprocessors for standard x86 CPUs – and perhaps ARM processors one day.
Nvidia obviously has the pole position when it comes to offloading HPC work from CPUs to GPU coprocessors, thanks in no small part to the development and adoption of the CUDA programming environment that spans CPUs and GPUs. CUDA gives Nvidia an edge over GPUs from Advanced Micro Devices – at least for the moment – but as the history of the computing market has taught us, any advantage can be undermined, just like GPUs are eating into CPUs in hybrid clusters these days. If performing a floating point operation is cheaper on a DSP than it is on a GPU, then it will win – as long as programming for DSPs is not radically more difficult than coding for CPUs and GPUs.
Trying to use DSPs to build supercomputers is not a new idea. Back at the SC92 supercomputing conference, the Swiss Federal Institute of Technology in Zurich was showing off a supercomputer called MUSIC, short for Multi-Signal-Processor System with Intelligent Communication (and yes, that abbreviation doesn't work in English particularly well).
In a paper presented at the conference, Swiss boffins lashed together 60 DSPs and delivered 3.8 gigaflops of number-crunching performance for 800 watts and on neural network learning and molecular dynamics code. This cluster ran five times faster than a Cray Y-MP and two times faster than an NEC SX-3, both of which were vector machines. Columbia University has been monkeying around with parallel DSP machines for a long time and also helped IBM develop its BlueGene family of massively parallel supers. BlueGene is, in essence, a parallel DSP machine that had its brains replaced with PowerPC engines.
At the SC11 event this month in Seattle, Texas Instruments launched its TMS320C66x family of multicore DSPs, adding support for the OpenMP API set to the DSP to make it easier to offload calculations from the CPU to the GPU. DSPs are notoriously hard to program, as GPUs used to be before CUDA and OpenCL came along. The TMS320C66x family of DSPs needs a much easier nickname if it is to become cool and talked about; something like Fourier would seem to be most appropriate, given the use of DSPs to do fast Fourier transforms.
Block diagram of TI's C66x digital signal processors
The C66x DSPs are based on an architecture that TI calls KeyStone, which allows for anywhere from one to eight DSP cores to be put on a single chip and to share cache memory, main memory controllers, I/O controllers – just like multicore x86 and ARM processors do. The most recent DSP out of TI is called the C6678, and it is designed to scale to eight cores on a single chip, although the four-core version is only shipping at the moment. The DSP cores run at 1GHz or 1.25GHz and with all eight of them humming at 1.25GHz, the C66x delivers 160 gigaflops of single-precision floating point oomph. Like early GPUs, the amount of double-precision math that the DSP chip can do is less than half of this, at 60 gigaflops. The C6678 has 32KB of L1 instruction cache and 32KB of L1 data cache per core and up to 8MB of shared L2 cache per DSP package. The chip has 12.8GB/sec of memory bandwidth into and out of the DSP and, here's the kicker, the chip only consumes 10 watts of juice.
The initial coprocessor board using the TI C66x DSPs is called the DSPC-8681 and it is made by Advantech. It puts four of these eight-core DSP chips (running at only 1GHz for some reason) on a single half-length PCI-Express 2.0 x8 card. The card has 1GB of DDR3 memory running at 1.33GHz and two Gigabit Ethernet ports. The DSPC-8681 delivers 512 gigaflops at single precision and 192 gigaflops at double precision. This card has a list price of around $1,100.
Kenneth Nesteroff, business development manager for multicore processors at IT's DSP Systems unit, tells El Reg that in the first quarter, Advantech will come out with a full-length PCI-Express card that will deliver around 1 teraflops of single precision performance at a cost of around $2,000 and within a 110 watt thermal envelope.
Longer term, TI plans to pack the performance of the DSPC-8681 card into a single chip package called the TMS320TCI6609 – and then plunk four of these onto a single PCI-Express 2.0 card. TI is not saying how it will get that 512 gigaflops of performance out of a single chip, but it stands to reason that there will be a process shrink, a DSP core count boost, and a faster clock speed. (Or TI could just be packaging up four C6678 DSPs into a single package.)
What TI is saying is that the future TCI6609 DSP will deliver that 512 gigaflops of single precision performance at 32 watts, so a four-chip PCI-Express card will deliver 2 teraflops of single-precision oomph in under 200 watts of total power, including an unknown amount of DDR3 main memory for the DSPs.
What would be even more interesting is if TI would put one of its quad-core Cortex-A8 ARM derivatives on a small form factor system board along with four of these C6678 DSPs, or if it doubled that up to a quad-core Cortex-A15 with maybe eight DSPs on the board. Slap a hybrid InfiniBand/Ethernet ConnectX-3 adapter from Mellanox Technologies on there and you could build a low-power supercomputer.
The hardware is the easy part, of course. The software stack would be a little more problematic. If TI is serious about using DSPs and ARMs in HPC, it is going to have to come up with something more than support for OpenMP and more like Nvidia's CUDA environment. ®