Etched looks to challenge Nvidia with an ASIC purpose-built for transformer models

Startup says Sohu chip will be 20x faster than Nvidia's H100 in Llama 70B … assuming it's actually built

Following ChatGPT's debut in late 2022, GPUs – Nvidia's in particular – have become synonymous with generative AI.

However, given the scale at which AI is now being deployed, some have questioned whether an application-specific approach to transformer models – the fundamental architecture on which large language and diffusion models are derived – could offer greater performance and efficiency than existing accelerators.

This is the bet that AI infrastructure startup Etched is making with its first inference chip, dubbed Sohu. Unlike GPUs – which, despite their name, are very much general-purpose processors – Etched's first product is designed to do one thing and one thing only: serve up transformer models, like LLMs.

The part can't run convolutional neural networks, state space, or any other kind of AI model – just transformers. By stripping out the flexibility associated with the current crop of accelerators and focusing not just on AI, but specific kinds of models, Etched claims to achieve a 20x performance advantage over Nvidia's H100.

"If you are willing to specialize – if you're willing to make a bet on the architecture, essentially burn that transformer architecture into the silicon – you can get way more performance, like an order of magnitude more performance," COO Robert Wachen boasted in an interview with The Register.

And in terms of performance, the startup claims the chip will achieve 500,000 tokens per second running Llama 70B and that a single eight Sohu server will replace 160 H100s.

That's a bold claim for a chip that hasn't taped out – and, for the moment, only exists in emulation running on FPGAs. However, the idea that an ASIC would outperform a general-purpose processor, like a GPU, in a narrow enough task, shouldn't come as much of a surprise. In general, ASICs trade functionality and programmability for simplicity of design and blazing fast throughput.

More important for the two-year-old startup is that the prospect of a chip that can drive down the cost of AI inferencing is tantalizing enough to warrant $120 million in a Series A funding round led by Primary Venture Partners and Positive Sum Ventures.

However, raw compute is only one of several factors impacting inference performance. As we've seen with Nvidia's H200 and AMD's MI325X, memory bandwidth and capacity appear to be the bottlenecks to beat.

In this respect, Etched's first chip doesn't look that competitive. Even if its performance claims are to be believed, with 144GB of HBM3 across six stacks, our estimates put its maximum bandwidth somewhere in the neighborhood of 4TB/sec. That puts it well behind the H200 and MI300X – not to mention the MI325X or Blackwell.

While inferencing may be memory bound at lower batch sizes, Wachen makes the argument that won't necessarily be the case at the levels of utilization Etched is targeting.

As we understand it, the main advantage of Etched's transformer-centric architecture is that it will allow for extremely large batch sizes far beyond what's reasonable on modern GPUs. In the context of a chatbot, you can equate batch size to the number of concurrent queries the chip can process. That's particularly important for services like ChatGPT, Gemini, or Copilot that are serving thousands – if not millions – of requests every second.

"Instead of being able to run batch size 32 and then go to 64 and have high performance degradation, we can run batch sizes in the thousands without any performance degradation," claimed Wachen.

If true, that could give it an advantage over Nvidia's current crop of GPUs.

However, it's worth noting that Etched is still constrained by the available supply of HBM3. An eight Sohu-based system is only going to have about 1.1TB on board. What's more, at larger batch sizes, more memory typically needs to be dedicated to the key value (KV) cache – something that could limit how large of a model a single Etched system is ultimately able to serve.

Etched's approach isn't without its downsides, either. ASICs are great for the one thing they're designed to do – but essentially worthless if you want to do something else.

This makes Etched's business proposition a bet that transformer models will not only be deployed and run at sufficient scale to justify a fixed-function accelerator, but also that the transformer architecture won't give way to different and more efficient approaches to machine learning down the line.

For the moment, Sohu's main priority is bringing its first chips to market. So far, it claims to have emulated a slice of it on FPGAs with the intent to tape out in the not too distant future. When? Wachen hesitated to say, but implied heavily that the first chips were less than two years away.

In the long term, the startup believes there will be sufficient demand for even more specialized ASICs tailored to the demands of specific models. We'll see. ®

More about


Send us news

Other stories you might like