Megachips or decoupled approach? AI chip design companies accounting for operating costs

Chip crunch pulls focus


AI chip startups are thinking more about bang-for-the-buck on their processors amid a historic semiconductor shortage and rising prices of silicon.

Billions of dollars have been poured into AI startups, and the talking point has mostly been about performance. As training models in AI demand more computing resources, the emphasis is shifting to the cost of computing on their chips.

The performance-per-dollar on AI chips has "become very important," Naveen Rao, whose AI chip company Nervana Systems was acquired by Intel for $350m in 2016, told The Register. Rao previously ran the AI product group at Intel and quit last year.

"Lots of [dollars] have gone into chip companies, and I think a lot of that has come without proper analysis," said Rao, who started an AI company this year which is still in stealth mode.

There are divergent approaches to AI chip design, and the debate is whether an integrated chip or a decoupled approach would be more economical. To chip makers, this is a familiar battle: it's a retread of whether components should be integrated in AI megachips or distributed over a network of processing units on a board or a network.

Popular AI systems today harness the power of hundreds of Nvidia's GPUs distributed in computers. Rao is a proponent of this distributed approach, with AI processing split over a network of cheaper chips and components that include low-cost DDR memory and PCI-Express interconnects.

"The costs of building massive chips is much higher than the tiny chips and cables used to connect multiple chips together. The interconnect cables and chips benefit from economies of scale...these aren't bespoke to AI compute, they are used in many applications," he said.

Cerebras Systems CEO Andrew Feldman threw cold water over Rao's arguments, saying that stringing together a chain of chips as an AI cluster can add to the hardware and electric bills.

"Let's look at what Tesla did. "Did they use PCI links? No. Did they make a bigger chip? Yes," Feldman told The Register, adding that "nonlinear scaling combined with all the other infrastructure necessary to tie them together is punishingly power inefficient."

Cerebras' own WSE-2 AI megachip, which shipped in August, is the largest processor ever built. It has 850,000 cores, twice that of its predecessor, and speeds up the interconnect bandwidth to 220 Pb/s, which the company claims is more than 45,000 times faster than the bandwidth delivered between graphics processors.

"Our units are expensive, but so is buying [Nvidia] 12 DGX A100s. At every phase, we are less expensive or same as than a comparable amount of GPUs and we use less power," Feldman said.

There are other hidden costs, like buying 50 CPUs to connect 200 GPUs. "How do you put those GPUs together? You need giant Infiniband or Ethernet switches and each of those has optics pulling power," Feldman said.

AI chip development is so diverse that you can't take a one-size-fits-all approach, others said. The software may define the hardware, and some chips may facilitate processing on the edge before feeding relevant data into neural nets.

Hardware platforms are developed without much consideration of software and how it will scale among the platforms, said Rehan Hameed, chief technology officer at Deepvision, in a chat at last month's AI Hardware Summit. The company develops a software development kit that maps AI models to various hardware.

AI chip design may also play out along the lines of Koomey's Law, which is a corollary about how electrical efficiency of computation doubled about every 1-1/2 year over six decades. It also factors in sensors, smartphones, and other devices, said Dan Hutcheson, industry analyst at VLSI Research.

The CPU battle going on for decades shifted to energy efficiency after chip makers stopped cranking up the frequency of chips. AI applications are getting complex, and there will be a limit to the amount of electricity being thrown to solve complex problems, Hutcheson said.

"The problem is with self driving cars and electric cars. The car's AI system should not consume half power of electrical mileage," Hutcheson said. ®

Similar topics


Other stories you might like

  • Google sours on legacy G Suite freeloaders, demands fee or flee

    Free incarnation of online app package, which became Workplace, is going away

    Google has served eviction notices to its legacy G Suite squatters: the free service will no longer be available in four months and existing users can either pay for a Google Workspace subscription or export their data and take their not particularly valuable businesses elsewhere.

    "If you have the G Suite legacy free edition, you need to upgrade to a paid Google Workspace subscription to keep your services," the company said in a recently revised support document. "The G Suite legacy free edition will no longer be available starting May 1, 2022."

    Continue reading
  • SpaceX Starlink sat streaks now present in nearly a fifth of all astronomical images snapped by Caltech telescope

    Annoying, maybe – but totally ruining this science, maybe not

    SpaceX’s Starlink satellites appear in about a fifth of all images snapped by the Zwicky Transient Facility (ZTF), a camera attached to the Samuel Oschin Telescope in California, which is used by astronomers to study supernovae, gamma ray bursts, asteroids, and suchlike.

    A study led by Przemek Mróz, a former postdoctoral scholar at the California Institute of Technology (Caltech) and now a researcher at the University of Warsaw in Poland, analysed the current and future effects of Starlink satellites on the ZTF. The telescope and camera are housed at the Palomar Observatory, which is operated by Caltech.

    The team of astronomers found 5,301 streaks leftover from the moving satellites in images taken by the instrument between November 2019 and September 2021, according to their paper on the subject, published in the Astrophysical Journal Letters this week.

    Continue reading
  • AI tool finds hundreds of genes related to human motor neuron disease

    Breakthrough could lead to development of drugs to target illness

    A machine-learning algorithm has helped scientists find 690 human genes associated with a higher risk of developing motor neuron disease, according to research published in Cell this week.

    Neuronal cells in the central nervous system and brain break down and die in people with motor neuron disease, like amyotrophic lateral sclerosis (ALS) more commonly known as Lou Gehrig's disease, named after the baseball player who developed it. They lose control over their bodies, and as the disease progresses patients become completely paralyzed. There is currently no verified cure for ALS.

    Motor neuron disease typically affects people in old age and its causes are unknown. Johnathan Cooper-Knock, a clinical lecturer at the University of Sheffield in England and leader of Project MinE, an ambitious effort to perform whole genome sequencing of ALS, believes that understanding how genes affect cellular function could help scientists develop new drugs to treat the disease.

    Continue reading

Biting the hand that feeds IT © 1998–2022