Thanks for the (shared) memory
To feed all those command streams, Demers says, a new memory system is needed. In previous AMD GPU architectures, the memory system was a read-only cache; in the new architecture, it's read-write. "It's a generalized cache just like we have in CPUs," he says.
Total bandwidth between the CUs and the caches is, of course, dependent upon the number of CUs and the clock speed. Assuming a clock of around one gigahertz, "If you think of a CU as the equivalent of a SIMD – which isn't the case, but today we ship with 24 of these – 24 CUs would be one and a half terabytes of bandwidth to their L1 caches," Demers says. "Pretty good numbers."
Don't expect AMD to stick to 24-CU implementations, however. Demers talked of future designs with over a hundred CUs – and it's not tough to do the math to figure out what the total cache bandwidth would be in such chips: 100 CUs would top 6 terabytes of total bandwidth.
To add more memory-system versatility, there's a full interconnect between the L2 and L2 caches. "The L2s are more physically based. They match your memory," Demers explained. "They're also where all the coherency happens – and that's what I mean by the physical binding of the L2s."
The L1s get their data from their associated L2s, but the L2s – since they're the soul of coherency – will communicate with one another. The GCN also envisions conherency being handled between both CPU and GPU at the L2 level. "I'm talking probe traffic," Demers says, "I'm talking all the usual stuff you've come to expect on coherency."
GPU CUs and CPU cores will find coherency at the L2 level. Discrete GPUs can join over PCIe (click to enlarge)
With all the CUs having access to all the data that's in the L2 farm, time-consuming trips back and forth to and from far-off system memory would be minimized, pruning latency. Discrete GPUs will also join in the coherency mix, with all traffic being tunneled over PCIe. "Discrete GPUs and Fusion APUs will all use the same core technology," Demers explains.
x86 spoken here
x86 support, he says, means that "our GPUs have to have address-translation caches. Basically, they take virtual addresses and they translate that into physical addresses." Address-translation caches already exist in AMD GPUs, but in the new architecture, they'll be talking in x86 language.
On the CPU side, "an OS-visible IOMMU [input/output memory-management unit] – just like the CPU has an MMU, which handles which handles physical to virtual translation on the CPU – needs to exist," Demers says.
With an IOMMU – which will be part of both AMD's discrete CPUs and APUs – the chips will be able to support address-translation requests. Demers also notes that should their be a page fault, "the GPU will be happy with that – well, not necessarily happy, but it will survive that. It will wait until that page is brought in by the operating system and made local, then – bang! – it'll keep on running."
The x86 address space will provide "all the goodness" that comes from a virtual address space, and will be available for the GPU in the new architecture, Demers said, specifically citing over-subscription. "Our plan is that eventually all these devices – whether CPUs or GPUs – are in the same unified 64-bit address space."
As might be assumed due to Demers' page-fault example, OS support will be required for IOMMUs, just like it is on MMUs, so AMD is now working with operating-system designers. Although he didn't specifically say which ones, Microsoft's presence at AMD's event might well be counted as a major hint.
All these features will stretch across AMD's graphics-capable product line. "I'm not talking about an APU, I'm not talking about a GPU, I'm talking about an IP of a core that's going to be used in all our products going forward," Demers says. "Over the next few years we're going to be bringing you all of this throughout all of our products that have GPU cores."
Meat and potatoes
Despite spending a raft of development time on this fundamentally different GPU architecture, AMD also spent some time digging into such meat-and-potatoes graphics necessities as good ol' 3D performance.
Heterogeneity is all well and good, but AMD has some 3D improvement in mind, as well (click to enlarge)
"I did say that 3D and compute are starting to merge – and in my mind they already have," Demers says. "Somebody recently asked me about APIs – well, we're full of ideas for graphics. And we still love APIs and we think that developers will continue to use APIs."
He suggests that some developers will want to "go directly to compute," but he said that AMD would continue to work with partners such as Khronos – the OpenCL caretaker – and DX11-provider Microsoft to expose to devs more features that AMD provides in its hardware.
As an example of something that the new architecture will support, Demers offers partially resident textures (PRTs), which he defined as the ability to "tell an application: 'Look, create textures of any size you want, and then bring in the parts that you need when you want them'."
2013 and beyond
All this new stuff doesn't mean that all the old stuff has been jettisoned. Fixed-function elements such as Raster Ops and Z units, for example, are still there with their own caches. "We don't want to get rid of any things that are good in our core," Demers says. "We're going to continue to drive [fixed-function features] forward and continue to put more of those units [on chip] as cost and process allow us.
The read/write cache in the GCN will also be available as a texture cache. "Larger caches, higher throughputs – those are going to benefit texturing as well," Demers says. In addition, true virtual memory will enable such niftiness as being able to pre-compute massive scenes and load portions only as needed, smoothing performance.
"I really am excited about Fusion System Architecture (FSA) and 3D merging," Demers says – excitably, as one might imagine. "Compute, and graphics APIs, and hybrids of all those things – it's really cool."
Unfortunately, you'll have to wait a while to experience that coolness. AMD's Bulldozer-based APU, Trinity, which was demoed on the same Fusion Summit stage two days before Demers' presentation, will be VLIW-based when it appears next year. Best-guesstimates put GCN-based APUs somewhere in the 2013 time frame.
With the introduction of FSA and the GCN – oh, and let's not forget the Bulldozer and Bobcat CPU cores – AMD is betting the farm that the future will belong to heterogeneous computing, where tasks are given to various and sundry cores according to the ability, and distributed from apps according to their need.
For AMD's sake, let's hope that if they have seen the future, and that their implementation works better than did the terrestrial analog of that to/from equation. ®