This article is more than 1 year old
Intel finds cure for CPU old age
The self-healing chip
ISSCC Intel has developed a research microprocessor that it claims can improve throughput of degraded chips or chip environments by over 40 per cent. Such degradation might involve variations in supply voltage, temperature changes, or simply aging transistors.
As described by Intel staff research scientist Keith Bowman on Tuesday at the International Solid-State Circuits Conference (ISSCC) in San Francisco, these degradations can affect a chip's signal timing. To shield a processor from timing muck-ups, chip designers generally insert guardbands into the data flow, but Intel's new research chip works to minimize the use of guardbands.
The term "guardband" has different meanings in telecom and magnetic-media recording, but in microprocessor design, it refers to a timing differential between, say, data and clock signals. Since signal rates can vary, an extra slice of time - the guardband - is inserted into the design to allow for signals to communicate without being perfectly aligned.
The problem with the guardband method is that it wastes energy and time - and, in processor design, wasted time equals reduced throughput. The Intel research chip works to minimize the need for guardbands by detecting timing and instruction errors and modifying them dynamically.
To mitigate guardband inefficiency, Bowman and his team developed a chip that contains "resilient and adaptive circuits." The key to the design are embedded error-detection sequentials (EDS) and tunable replica circuits (TRC).
Boiled down to their essentials, EDS and TRC work together to detect errors, replay them as necessary to get correct results, and dynamically tune the chip's clock speed to allow for more error-free operation.
"The key research contributions of this work lie in the error detection, the error-correction circuits, as well as the adaptive clock controller," Bowman said.
"For error detection, we implement error-detection sequentials on actual critical paths in the core. And then we introduce the first implementation where we use a tunable replica circuit combined with error recovery that allows us to detect both fast-changing and slow-changing dynamic variations."
These types of error-reduction technologies have been around for awhile - Bowman himself has presented research results on his work on both EDS and TRC. Among the refinements this time around, he said, is a new recovery algorithm.
In addition, if the error-control unit sees the error frequency passing a selected threshold of forcing the chip to replay many instructions, it will trigger the adaptive clock controller to slow the chip's clock until the errors and replays decrease. When all is well again, it will then boost the clock back up.
Among the examples of the combined EDS and TRC effect, Bowman presented results comparing his team's chip with a garden-variety chip without these enhancements to show that his team's techniques can be used to either improve throughput or save energy.
Bowman said that if his team lowers the supply voltage of a conventional chip design so that there's equal energy relative to the EDS and the TRC, they can achieve a 41 per cent throughput gain with resilient circuits as compared to the conventional design. And if they increase the power supply of the conventional design so that they have an equal throughput, they can achieve a 22 per cent energy reduction. ®