HPC

Nvidia puts Tesla K20 GPU coprocessor through its paces

Early results on Hyper-Q, Dynamic Parallelism speedup


Back in May, when Nvidia divulged some of the new features of its top-end GK110 graphics chip, the company said that two new features of the GPU, Hyper-Q and Dynamic Parallelism, would help GPUs run more efficiently and without the CPU butting in all the time. Nvidia is now dribbling out some benchmark test results ahead of the GK110's shipment later this year inside the Tesla K20 GPU coprocessor for servers.

The GK110 GPU chip, sometimes called the Kepler2, is an absolute beast, with over 7.1 billion transistors etched on a die by foundry Taiwan Semiconductor Manufacturing Corp using its much-sought 28 nanometer processes. It sports 15 SMX (streaming multiprocessor extreme) processing units, each with 192 single-precision CUDA cores and 64 double-precision floating point units tacked on to every triplet of CUDA cores. That gives you 960 DP floating point units across a maximum of 2,880 CUDA cores on the GK110 chip.

Nvidia has been vague about absolute performance, but El Reg expects for the GK110 to deliver just under 2 teraflops of raw DP floating point performance at 1GHz clock speeds on the cores and maybe 3.5 teraflops at single precision. That's around three times the oomph – and three times the performance per watt of the thermals are about the same – of the existing Fermi GF110 GPUs used in the Tesla M20 series of GPU coprocessors.

Just having more cores is not going to boost performance. You have to use those cores more efficiently, and that's what the Hyper-Q and Dynamic Parallelism features are all about.

Interestingly, these two features are not available on the GK104 GPU chips, which are used in the Tesla K10 coprocessors that Nvidia is already shipping to customers who need single-precision flops. The Tesla K10 GPU coprocessor puts two GK104 chips on a PCI-Express card and delivers 4.58 teraflops of SP number-crunching in a 225 watt thermal envelope – a staggering 3.5X the performance of the Fermi M2090 coprocessor.

A lot of supercomputer applications run the message passing interface (MPI) protocol to dispatch work on parallel machines, and Hyper-Q allows the GPU to work in a more cooperative fashion with the CPU when handling MPI dispatches. With the Fermi cards, the GPU could only have one MPI task dispatched from the CPU and offloaded to the GPU at a time. This is an obvious bottleneck.

Nvidia's Hyper-Q feature for Kepler GPUs

Nvidia's Hyper-Q feature for Kepler GPUs

With Hyper-Q, Nvidia is adding a queue to the GPU itself, and now the processor can dispatch up to 32 different MPI tasks to the GPU at the same time. Not one line of MPI code has to be changed to take advantage of Hyper-Q, it just happens automagically as the CPU is talking to the GPU.

To show how well Hyper-Q works (and that those thousands of CUDA cores won't be sitting around scratching themselves with boredom), Peter Messmer, a senior development engineer at Nvidia, grabbed some molecular simulation code called CP2K, which he said in a blog was "traditionally difficult code for GPUs" and tested how well it ran on a Tesla K20 coprocessor with Hyper-Q turned off and then on.

As Messmer explains, MPI applications "experienced reduced performance gains" when the MPI processes were limited by the CPU to small amounts of work. The CPU got hammered and the GPUs were inactive a lot of the time. And so the GPU speedup in a hybrid system was not what it could be, as you can see in this benchmark test that puts the Tesla K20 coprocessor inside of the future Cray XK7 supercomputer node with a sixteen-core Opteron 6200 processor.

Hyper-Q boosts for nodes running CP2K molecular simulations by 2.5X

Hyper-Q boosts for nodes running CP2K molecular simulations by 2.5X

With this particular data set, which simulates 864 water molecules, adding node pairs of CPUs and GPUs didn't really boost the performance that much. With sixteen nodes without Hyper-Q enabled, you get twelve times the performance (for some reason, Nvidia has the Y axis as relative speedup compared to two CPU+GPU nodes). But on the same system with sixteen CPU+GPU nodes with Hyper-Q turned on, the performance is 2.5 times as high. Nvidia is not promising that all code will see a similar speedup with Hyper-Q, mind you.

El Reg asked Sumit Gupta, senior director of the Tesla business unit at Nvidia, why the CP2K tests didn't pit the Fermi and Kepler GPUs against each other, and he quipped that Nvidia had to save something for the SC12 supercomputing conference out in Salt Lake City in November.

With Dynamic Parallelism, another feature of the GK110 GPUs, the GPU is given the ability to dispatch work inside the GPU as needed and forced by the calculations that are dispatched to it by the CPU. With the Fermi GPUs, the CPU in the system dispatched work to one or more CUDA cores, and the answer was sent back to the CPU. If further calculations were necessary, the CPU dispatched this data and the algorithms to the GPU, which sent replies back to the CPU, and so on until the calculation finished.

Dynamic Parallelism: Schedule your own work, GPU

Dynamic Parallelism: schedule your own work, GPU

There can be a lot of back-and-forth with the current Fermi GPUs. Dynamic Parallelism lets the GPU spawn its own work. But more importantly, it also allows for the granularity of the simulations to change dynamically, getting finer-grained where interesting things are going on and doing mostly nothing in the parts of the simulation space where nothing much is going on.

By tuning the granularity of the simulation with the granularity of the data across space and time, you will get better results and do less work (in less time) than you might otherwise with fine-grained simulation in all regions and timeslices.

Performance gains from dynamic parallelism for GK110 GPUs

Performance gains from dynamic parallelism for GK110 GPUs

The most important thing about Dynamic Parallelism is that the GPU automatically makes the decision about the coarseness of calculations, reacting to data and launching new threads as needed.

To show off early test results for Dynamic Parallelism, Nvidia did not do a fluid-mechanics simulation or anything neat like that, but rather in another blog post, Nvidia engineer Steven Jones ran a Quicksort benchmark on the K20 GPU coprocessor with the Dynamic Parallelism turned off and then on.

If you've forgotten your CompSci 101, Jones included the Quicksort code he used in the test in the blog post. Interestingly, it takes half as much code to write the Quicksort routine on the GPU with Dynamic Parallelism turned on and used because you don't have to control the bouncing back and forth between CPU and GPU.

As you can see in the chart above, if you do all of the launching for each segment of the data to be sorted from the CPU, which you do with Fermi GPUs, then it takes longer to do the sort. On the K20 GPU, Dynamic Parallelism boosts performance of Quicksort by a factor of two and scales pretty much with the size of the data set. It will be interesting to see how much better that K20 is at doing Quicksort compared to the actual Fermi GPUs, and how other workloads and simulations do with this GPU autonomy.

Gupta tells El Reg that the Tesla K20 coprocessors are on track for initial deliveries in the fourth quarter. ®

Similar topics


Other stories you might like

  • DORA explorers see pandemic boost in numbers of 'elite' DevOps performers

    Or is it that they're just more inclined to complete surveys about themselves?

    A report from DORA, that's the Devops Research and Assessment sponsored by Google and other DevOps vendors, says 26 per cent of surveyed technology workers consider themselves "elite performers."

    DORA was founded in 2015 by DevOps specialists Nicole Forsgren, Jez Humble, and Gene Kim, and in late 2018 was absorbed by Google Cloud. Each year the gang, now led by Google's Dustin Smith, publishes an Accelerate State of DevOps report, co-sponsored by nine other DevOps outfits.

    The research is based on responses from "1,200 working professionals," we're told, with over half in organizations of 500 or more employees. The majority of respondents work in development, software engineering, DevOps, site reliability engineering, or management. Two out of five participants are said to have at least 16 years of IT experience.

    Continue reading
  • Senior IBMer hit with £290k demand from Big Blue in separate case as unfair dismissal claim rolls on

    High Court and Employment Tribunal cases to be heard soon

    A former IBM general manager who was posted to the United Arab Emirates is being sued by the company for £290,000 after filing an employment tribunal case claiming unfair dismissal.

    In its particulars of claim lodged on 10 February 2021 and recently made available by the court, Big Blue claimed that former Middle East GM Shamayun Miah should hand back two "special payments" because it sacked him within two years of paying him the cash lump sums.

    Miah was paid pre-tax sums of £175,000 on 1 January 2018 and a further £100,000 on 1 January 2019, according to IBM's High Court filing. IBM has claimed he is "liable" to repay a portion of each of payment, together totalling £145,750.

    Continue reading
  • If you're Intel, self-driving cars look an awful lot like PCs

    Hardware capabilities, latest feature updates? You'll get what you pay for

    Intel's vision of the computing architecture of autonomous vehicles is similar to that of PCs, with pricey models getting better hardware and the latest software, and cheaper self-driving cars getting the bare minimum.

    The segments of premium and mid-range cars will need extra compute and over-the-air update capabilities to enable increasing levels of autonomous driving, said Erez Dagan, executive vice president at Mobileye, Intel's self-driving car system division, speaking at the Evercore ISI Autotech & AI Forum this week.

    On the other hand, low-end vehicles will have basic equipment, sensors, and features as mandated or incentivized by regulations like the EU's General Safety Regulation, which focuses on improving driver safety.

    Continue reading
  • Researchers finger new APT group, FamousSparrow, for hotel attacks

    Espionage motive mooted in attacks which hit industry, government too

    Researchers at security specialist ESET claim to have found a shiny new advanced persistent threat (APT) group dubbed FamousSparrow - after discovering its custom backdoor, SparrowDoor, on hotels and government systems around the world.

    "FamousSparrow is currently the only user of a custom backdoor that we discovered in the investigation and called SparrowDoor," ESET researcher and co-author of the report Tahseen Bin Taj explained in a prepared statement. "The group also uses two custom versions of Mimikatz. The presence of any of these custom malicious tools could be used to connect incidents to FamousSparrow."

    The group can be traced back to 2019, the researchers claimed, though the attacks tracked in the report made use of the ProxyLogon vulnerability in Microsoft Exchange starting in March this year. Victims were spread around Europe, the Middle East, the Americas, Asia, and Africa - without a single one being discovered in the US, oddly.

    Continue reading
  • Is it a bird? Is it a plane? Nah, it's just Windows suffering from a bit of vertigo

    Up above the streets and houses, XP's flying high

    Bork!Bork!Bork! Windows XP continues to hang in there – quite literally – as the operating system does what it does best some 90 metres above the London's River Thames.

    The screen, spotted by Register reader Andy Jones while safely ensconced within the confines of an Emirates Air Line gondola, appears to be in something of a boot loop. It looks to be endlessly resetting as the UK capital city's cable car attraction grinds itself along the kilometre or so between the Greenwich Peninsula and the Royal Docks.

    Continue reading
  • How many Android containers can you fit on your VM?

    The Register speaks to Canonical about running the OS in the cloud

    Interview Developers targeting Android are spoiled for choice with their platforms.

    There are a variety of options available for running Android application development environments these days. Even Microsoft has promised that its upcoming Windows 11 will eventually be able to run the apps on the desktop and has long since supported the mobile OS via its Your Phone app, even while smothering its ailing Windows Phone with a cuddly Android pillow.

    For Canonical, however, Anbox remains a cloud product, according to Simon Fels, engineering manager and is therefore unlikely to feature in any desktop version of the company's Ubuntu distribution any time soon, although with September's announcement it will now cheerfully scale from the heights of the cloud down to a single Virtual Machine via the Appliance version.

    Continue reading
  • Infosys admits it still hasn't fully fixed Indian tax portal

    Deadline came and went, but over 750 'resources' are still hard at work

    Infosys has admitted it has missed the Indian government's deadline to fix the tax portal it built, but which has been a glitchy mess since its June 2021 launch.

    The portal was introduced to make filing taxes more efficient. It delivered the opposite – India's government was forced to extend filing deadlines amid user complaints that they found the portal impossible to use. The portal was even placed into "emergency maintenance" mode at one point, during which it was completely unavailable.

    Infosys was shamed by ministers and on August 22nd was given a September 15th deadline to fix the portal.

    Continue reading
  • Here's an idea: Verification for computer networks as well as chips and code

    What tools are available? What are the benefits? Let's find out

    Systems Approach In 1984, artificial intelligence was having a moment. There was enough optimism around it to inspire me to explore the role of AI in chip design for my undergraduate thesis, but there were also early signs that the optimism was unjustified.

    The term “AI winter” was coined the same year and came to pass a few years later. But it was my interest in AI that led me to Edinburgh University for my PhD, where my thesis advisor (who worked in the computer science department and took a dim view of the completely separate department of artificial intelligence) encouraged me to focus on the chip design side of my research rather than AI. That turned out to be good advice at least to the extent that I missed the bursting of the AI bubble of the 1980s.

    The outcome of all this was that I studied formal methods for hardware verification at a point in time where hardware description languages (HDLs) were just getting off the ground. These days, HDLs are a central part of chip design and formal verification of chip correctness has been used for about 20 years. I’m pretty sure my PhD had no impact on the industry – these changes were coming anyway.

    Continue reading
  • Imagine a fiber optic cable that can sense it's about to be dug up and send a warning

    Forget wiring cities with IoT devices – this could be how wide-scale sensing gets done

    Imagine an optic fiber that can sense the presence of a nearby jackhammer and warn its owner that it is in danger of being dug up, just in time to tell diggers not to sink another shaft. Next, imagine that an entire city's installed base of fiber could be turned into sensors that will make planners think twice before installing IoT devices.

    Next, stop imagining: the tech is real, already working, and was yesterday used to demonstrate the impact of an earthquake.

    As explained to The Register by Mark Englund, CEO of FiberSense, the company uses techniques derived from sonar to sense vibrations in fiber cables. FiberSense shoots lasers down the cables and observes the backscatter as the long strands of glass react to their environment.

    Continue reading
  • Unable to test every tourist and unable to turn them away, Greece used ML to pick visitors for COVID-19 checks

    Inside the software built to figure out groups of potentially infected, asymptomatic passengers

    Faced with limited resources in a pandemic, Greece turned to machine-learning software to decide which sorts of travelers to test for COVID-19 as they arrived in the country.

    The system in question used reinforcement learning, specifically multi-armed bandit algorithms, to identify which potentially infected, asymptomatic passengers were worth testing and putting into quarantine if necessary. It also was able to produce up-to-date statistics on infections for officials to analyze, such as early signs of the emergence of COVID-19 hot spots abroad, we're told.

    Nicknamed Eva, the software was put to use at all 40 of Greece's entry points from August 6 to November 1 last year. Incoming travelers were asked to fill out a questionnaire detailing the country and region they were coming from as well as their age and gender. Based on these characteristics, Eva selected whether they should be tested for COVID-19 upon arrival. At its peak, Eva was apparently processing between roughly 30,000 and 55,000 forms a day, each form representing a household, and about 10 to 20 per cent of households were tested.

    Continue reading
  • Angry birds ground some Google Wing drones in Australia

    Between COVID and corvids, locked-down Aussies can't catch a break - or a coffee lowered from the treetops

    Some of Google parent company Alphabet's Wing delivery drones have been grounded by angry Australian birds.

    As reported by the Australian Broadcasting Corporation, and filmed by residents of Canberra, ravens have attacked at least one of Wing's drones during a delivery run.

    Canberra, Australia's capital city, is currently in COVID-caused lockdown. It's also coming into spring – a time when local birds become a menace in the leafy city. Magpies are a particular hazard because they swoop passers-by who they deem to be threateningly close to their nests and the eggs they contain. Being swooped is very little fun – magpies dive in, often from a blind spot, snapping their sharp beaks, and can return two or three times on a single run. Swooping is intimidating for walkers, and downright dangerous for cyclists.

    Continue reading

Biting the hand that feeds IT © 1998–2021