It's all money in the end
A lab scientist costs £100k/year. You can double that for an experiment. He has 300 users and they cost £60m/year. The move from Sun to Dell and a tenfold performance increase must have improved the output of his users. "It's all money in the end, taxpayers' money."
Calleja upgrades his hardware every two years on a rolling procurement and keeps hardware for four years. He delivers core hours to his users and has to continually demonstrate to them that paying for his core hours is cheaper than buying their own compute facilities. He said: "We're the only fully cost-centred HPC centre in the country not relying on subsidy. We have 80 percent paying users and we're breaking even."
Why Dell? It's cheaper and extremely reliable compared to competing suppliers. He's experienced a 1 percent electronic component failure rate in two years.
He's limited by power and space constraints. Calleja is upgrading now and is deploying 50 percent more compute power for 15 percent more electricity, adding 10 percent to his space footprint and the new kit is 20 percent of the original capital cost. That means he lowers his cost per core and offers his users better value core hours.
He said there are three research pillars: experiment; theory; and simulation. Simulation, using a supercomputer, enables you to go places you can't get to by experimentation. The need for simulation is horizontal across science.
Research applications now use shared and open source code that can be parameterised to provide the specific code set needed by researchers, whose time is not best spent writing code. That has become too specialised a job.
Datasets are kept in the data centre, inside the firewall, and users come to the HPC mountain instead of the HPC mountain coming to users, with massive data set transfers across network links between users and the HPC lab.
Calleja has two steps on his processor roadmap. He's looking forward first of all to Nehalem blade servers, 4-core Xeon 5500s, with possibilities for 6- and 8-core ones. The second step is to Sandy Bridge, Intel's next architecture after Nehalem which, he says, will run 8 operations per clock cycle instead of Nehalem's 4.
The blade servers will provide many more cores per rack, driving up the heat output, and he's anticipating moving to back-of-rack water cooling
He's thinking of setting up a solid state drive capacity pool for HPC applications that need the IOPS rates that SSDs can deliver, but SSD pricing has to come down to make this worthwhile. Lustre meta data might be stored in an SSD pool.