When someone provides me the funding I request for a project without fighting me tooth and nail over every bent copper, I believe that I should return the courtesy by going above and beyond to earn the trust that has been placed in me.
One result of this is that the bigger the budget of the project, the pickier I become about the hardware. I have gone over dozens of systems in the past two weeks to ensure that the hardware selection for my upcoming render farm project is the best it can be.
In the past I have often turned to whiteboxing my servers and my desktops. There is always a fair amount of heckling from the peanut gallery over the concept, but has worked well for me in the past.
My local distributor Supercom will take my component list and build out the system for me. They will sell it to me under one of their “house brands” and warranty it for three years. More if you buy the warranty extension. They have been good to me as regards RMAs, Other than Supercom’s limited selection of component manufacturers, I have no legitimate reason to complain.
One of my primary reasons for whiteboxing in the past – apart from price – has been power efficiency. Until relatively recently, it has been a difficult battle to get quality power supplies in Tier 1 equipment. Even if you were buying the highest priced model, you could wind up with some real lemons.
Given that a lot of these prejudices against using Tier 1s were from the early and mid 2000s, I have taken the opportunity of having an actual prototyping budget to take a new look at what’s available.
HPC server stress testing
The prototyping phase of this project saw me test the pre-canned HPC servers from all the Tier 1 vendors. I whiteboxed and benched six of my own different models. My mate and I even got so far as putting together a proof-of-concept rack of 10 1U dual GPU nodes with liquid cooling before deciding the maintenance requirements were too high. My mate (a graphic artist by trade) bought the prototype equipment and now has a spectacular liquid-cooled render farm/BOINC cluster in his basement for his own 3D rendering work.
After all my testing, prototyping and benchmarking I settled on the Supermicro 1026GT-TF-FM205. I am deeply impressed by this hardware.
It comes with a 1400W 80 Plus Gold PSU. (I cannot describe how happy seeing “80 Plus Gold” makes me.) It will support two Westmere Xeons and 192GB of EEC Reg DDR 3 in 12 DIMMs. It has 6 SATA hotswap bays running off the onboard ICH10R SATA controller. It comes with a (very basic) onboard Matrox Video card, dual Intel Gig-E NICs and more importantly an IPMI v2.0 KVM-over-LAN interface with a dedicated Realtek NIC.
Best of all, it comes with 2 Nvidia Fermi M2050 GPUs. Supermicro has here a Dual CPU 1U server with two scoops of sexy GPU, a nice amount of RAM and a great PSU all with thermals certified to run in a 1U box. I think I’m in love.
I bought a handful and tormented them for days. I varied the temperature and the humidity while running all systems flat out. I found that if I can keep a reasonable airflow through to the systems, I can run the datacenter at 25 degrees Celsius, even in dry-as-a-bone Alberta air. Once more, I am impressed; I know a few servers that would be quite upset with me if I fed them air that warm while under heavy load, and those servers aren’t running GPU space heaters.
Under the full load the rendering software can bring to bear, these systems seem to pull around 1000W from the wall. Allowing for top-of-rack networking gear, the bottom-of-rack power conditioners as well as various other accessories, I am aiming to put 25 of these units to a rack. That means dissipating a little over 25kW worth of heat per rack.
For only $6000 Canadian each, these are truly fantastic servers. Supermicro has made my life easy by providing me exactly was I was looking for at a price I am completely willing to pay. My next challenge - designing cooling systems to handle over 25kW per rack “on the cheap” – is the daunting one. ®