HPC

GPU-flingers' bash: Forget the Matrix, Neo needs his tensors

What's a tensor? Glad you asked...


HPC blog Last week, Nvidia held its biggest ever GPU Technology Conference (GTC). The big walk-away is that GPUs are rapidly becoming an expected and standard component of computing, table stakes in many cases, across the computing platform. Big deal right there and hence the frothiness of much of the coverage.

Before the conference, I shared the top-4 questions that I was hoping to get answered:

This also shows how popular I am on twitter, with one retweet and three likes.

Here’s a quick analysis of how the questions fared.

Q1. What's next after P100?

The P100 was only announced last year and it was quite a feat by itself. People are still clamoring for it if the company’s financial performance over the past year is any indication. So, slideware and a few macho specs about the future would have probably carried the day at GTC17. Would we get more? Well, NVIDIA did not disappoint. We saw actual silicon and detailed spec and real benchmarks: which is a very decisive answer to “so, what comes next?” question.

The V100 (V for Volta) is coming by Q3 of this year and it’s a nice clear step above the P100 (P for Pascal). They follow Kepler and Maxwell, nicely ordered alphabetically. Overall, it’s about 1.5x faster than P100 except for Deep Learning kernels for which it is a whopping 12x faster on paper, thanks to the new Tensor cores that specifically target AI workloads. Overall, 5x faster than P100 is what is projected/claimed, which is a much more practical target.

Weighing in at a robust 815mm2, oodles of flops of various kinds, and 300W, it’s big, it’s fast, and it’s hot, though it packs enough punch to be one of the most energy efficient chips out there for what it delivers, and it can be used in performance-first or energy-first modes to optimize for one or the other.

Here’s a quick comparison:

Remember that the taxonomy of computing speed goes from “on paper, but you’ll never see it”, to “guaranteed not to exceed, but if the stars are aligned you might see it”, to “possible, if you optimize things well”, to “typical, but it could be lower”. And in general, you’d better look at the minimum speed as much as you look at maximum speed.

For the workloads that are fast emerging, and the optimized frameworks and system software that is available, GPUs and other forms of what we call High Density Processing (HDP) is the way to go.

Our refrain these days is: digitization means lots of data, and making sense of lots of data increasingly looks like either an HPC problem or an AI problem. The chip targets that sweet spot.

So what’s a tensor?

A single number is a “scalar” (zero dimension, or “index”). A row of numbers is a “vector” (1 dimension, or index). A 2-dimensional row-column of numbers is a “matrix”. A Tensor is just a generalized definition of such mathematical objects, an n-dimensional object that follows specific transformation rules.

In Deep Neural Nets (DNN), you get to have layers and layers of “neurons” with coefficients that must be calculated and aggregations that must be tracked, and all of that can be nicely abstracted into tensors. Tensors are common language in physics, relativity, fluid mechanics and such, but their use in AI make them fresh territory in IT verbal landscape. Pretty sure most of the URLs are taken already!

And what about Moore’s Law?

Nvidia bills the chip as providing a 5x improvement over P100/Pascal in peak teraflops, and 15x over the M40/Maxwell which was launched two years ago. Do the math and yep, it’s better speed improvement than Moore’s Law, in fact more than 4x better.

Why/how is that, you might ask. And the answer is pretty much this: we used to gain speed by improving frequency, doing the same things at a faster clip and devoting more and more on-chip circuitry to helping one CPU be faster. Like building an ever more opulent palace. That all changed when multi-core and then many-core and now kilo-core chips came along. Instead of that palace, people started building condos and hotels. And as long as there are enough threads and tasks in your app to keep it all occupied, you get better throughput and faster turn-around.

With 5,120+2,560+640=8,320 cores of various types, the V100 is an 8 kilo-core chip. Bytes are way ahead but, yeah, cores too can be counted like Bytes can.

Q2. What's new on client side?

Nvidia rolled out a new deskside beast, the DGX Station, which packs 4xV100s. At 1.5KW, you’d expect it to come with a big noisy fan, but the box is liquid cooled. Closed loop so you don’t have to call the plumber, but it makes it nearly noiseless. At about $70k, it’s not quite your average “client” machine, it’s more of a “laptop of the gods”! Personal AI Supercomputer is how it was billed. But it looks like a workstation and counts! We didn’t notice any news on a follow on to the GeForce® GTX 1080 Ti, Nvidia's flagship gaming GPU, based on the Pascal GPU.

Q3. What's up with OpenPower?

We still think the real battle in server architecture is between Intel’s in-house coalition and what has come to be known as the Rebel Alliance: IBM’s OpenPower industry coalition. Intel has its all-star team: Xeon Phi, Altera, Omni-Path (plus Nervana/Movidius), while OpenPower counters with a dream team if its own: POWER, Nvidia, Xilinx, and Mellanox (plus TrueNorth). The all-in-house model promises seamless integration and consistent design, while the extended team offeres a best-of-breed approach. Both camps are pretty formidable. Both had merits. And there is real differentiation in strategy, design, and implementation.

Last year, the Rebels held their event with GTC. Not this year. And despite our continuing enthusiasm for some solid competition, and market checks that seem to indicate the Rebels are doing quite fine, we haven’t seen as much of OpenPower this past year as we had expected.

So it was quite reassuring to see the V100 come with a way faster NVLink interconnect technology. The second generation NVLink moves data at 300 GB/s. That’s 6 links each at 25 GB/s, equalling 150GB/s with a 300GB/s total out + in data rate.

What is also significant here is the improved scalability for multi-GPU/CPU configurations. NVLink supports CPU mastering and cache coherence with IBM Power 9 CPUs. That’s a pretty big deal and a nice boost for the Rebels.

Q4. What's the plan to keep ahead of new AI chips?

The competition in AI chips is heating up and we expect to see several new AI chips and architectures to show up in the coming months. They’re really optimizing for AI workloads which means lower-than-64-bit arithmetic and multiply-accumulate (MAC) instructions, multiple pipelines, separate integer and floating-point paths, related register/memory design, etc.

No doubt, you noticed the new Tensor cores in the V100 and wondered what it is. Each tensor core can do 64 multiply-add ops per cycle. It multiplies 16-bit numbers into 32- bit intermediates and adds them to 32-bit numbers, resulting in 32-bit numbers. That's 2 FP ops per cycle in mixed precision. There are 640 of them (8 per streaming multiprocessor(SM) and there are 80 of those) all running at 1.455 GHz so 64*2*640*1.455=119+ TFlops, so that’s where the 120 Tensor TFLOPS comes from.

Now, the P100 was pretty beastly for AI work and this just raises the game in a big way. Could you use even lower precision and make it go even faster? Yes, you could. But as you reduce precision, you’re going to need either well-behaved problems or more neurons and more layers to make up for it. And the V100 is still good for a lot of other workloads, which is a clear difference between it and the chips that go all the way with AI and, in the process, make themselves unsuitable for other workloads.

Anything else?

Nvidia also announced an immersive collaboration suite called Holodeck, showed off how deep learning an help out in complex ray tracing graphics to improve resolution, driverless car technology that can assist the driver, some cool work with AI software stack and containers, an updated DGX server now with 8xV100s, and HGX-1, a server for cloud computing that can easily vary the ratio of CPUs and GPUs that can be provisioned.

Similar topics


Other stories you might like

  • Prisons transcribe private phone calls with inmates using speech-to-text AI

    Plus: A drug designed by machine learning algorithms to treat liver disease reaches human clinical trials and more

    In brief Prisons around the US are installing AI speech-to-text models to automatically transcribe conversations with inmates during their phone calls.

    A series of contracts and emails from eight different states revealed how Verus, an AI application developed by LEO Technologies and based on a speech-to-text system offered by Amazon, was used to eavesdrop on prisoners’ phone calls.

    In a sales pitch, LEO’s CEO James Sexton told officials working for a jail in Cook County, Illinois, that one of its customers in Calhoun County, Alabama, uses the software to protect prisons from getting sued, according to an investigation by the Thomson Reuters Foundation.

    Continue reading
  • Battlefield 2042: Please don't be the death knell of the franchise, please don't be the death knell of the franchise

    Another terrible launch, but DICE is already working on improvements

    The RPG Greetings, traveller, and welcome back to The Register Plays Games, our monthly gaming column. Since the last edition on New World, we hit level cap and the "endgame". Around this time, item duping exploits became rife and every attempt Amazon Games made to fix it just broke something else. The post-level 60 "watermark" system for gear drops is also infuriating and tedious, but not something we were able to address in the column. So bear these things in mind if you were ever tempted. On that note, it's time to look at another newly released shit show – Battlefield 2042.

    I wanted to love Battlefield 2042, I really did. After the bum note of the first-person shooter (FPS) franchise's return to Second World War theatres with Battlefield V (2018), I stupidly assumed the next entry from EA-owned Swedish developer DICE would be a return to form. I was wrong.

    The multiplayer military FPS market is dominated by two forces: Activision's Call of Duty (COD) series and EA's Battlefield. Fans of each franchise are loyal to the point of zealotry with little crossover between player bases. Here's where I stand: COD jumped the shark with Modern Warfare 2 in 2009. It's flip-flopped from WW2 to present-day combat and back again, tried sci-fi, and even the Battle Royale trend with the free-to-play Call of Duty: Warzone (2020), which has been thoroughly ruined by hackers and developer inaction.

    Continue reading
  • American diplomats' iPhones reportedly compromised by NSO Group intrusion software

    Reuters claims nine State Department employees outside the US had their devices hacked

    The Apple iPhones of at least nine US State Department officials were compromised by an unidentified entity using NSO Group's Pegasus spyware, according to a report published Friday by Reuters.

    NSO Group in an email to The Register said it has blocked an unnamed customers' access to its system upon receiving an inquiry about the incident but has yet to confirm whether its software was involved.

    "Once the inquiry was received, and before any investigation under our compliance policy, we have decided to immediately terminate relevant customers’ access to the system, due to the severity of the allegations," an NSO spokesperson told The Register in an email. "To this point, we haven’t received any information nor the phone numbers, nor any indication that NSO’s tools were used in this case."

    Continue reading

Biting the hand that feeds IT © 1998–2021