Computer chips have advanced to the point that they’re no longer reliable: they’ve become “mercurial,” as Google puts it, and may not perform their calculations in a predictable manner. Not that they were ever completely reliable. CPU errors have been around as long as CPUs themselves.
They arise not only from design oversights but also from environmental conditions and from physical system failures that produce faults.
But these errors have tended to be rare enough that only the most sensitive calculations get subject to extensive verification if systems appear to be operating as expected. Mostly, computer chips are treated as trustworthy.
Lately, however, two of the world’s larger CPU stressors, Google and Facebook, have been detecting CPU misbehavior more frequently, enough that they’re now urging technology companies to work together to better understand how to spot these errors and remediate them.
“Our adventure began as vigilant production teams increasingly complained of recidivist machines corrupting data,” said Peter Hochschild, a Google engineer, in a video presented as a part of the Hot Topics in Operating Systems (HotOS) 2021 conference this week.
“These machines were credibly accused of corrupting multiple different stable well-debugged large-scale applications. Each machine was accused repeatedly by independent teams but conventional diagnostics found nothing wrong with them.”
Looking more deeply at the code involved and operational telemetry from their machines, Google engineers began to suspect problems with their hardware. Their investigation found that the incidence of hardware errors was greater than expected and these issues showed themselves sporadically, long after installation, and on specific, individual CPU cores rather than entire chips or a family of parts.
The Google researchers examining these silent corrupt execution errors (CEEs) concluded “mercurial cores” were to blame – CPUs that miscalculated occasionally, under different circumstances, in a way that defied prediction. (That’s mercurial as in unpredictable, not Mercurial as in the version control system of the same name.)
The errors were not the result of chip architecture design missteps, and they’re not detected during manufacturing tests. Rather, Google engineers theorize, the errors have arisen because we’ve pushed semiconductor manufacturing to a point where failures have become more frequent and we lack the tools to identify them in advance.
In a paper titled “Cores that don’t count” [PDF], Hochschild and colleagues Paul Turner, Jeffrey Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David Culler, and Amin Vahdat cite several plausible reasons why the unreliability of computer cores is only now receiving attention, including larger server fleets that make rare problems more visible, increased attention to overall reliability, and software development improvements that reduce the rate of software bugs.
“But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design,” the researchers state, noting that existing verification methods are ill-suited for spotting flaws that occur sporadically or as a result of physical deterioration after deployment.
Google’s not alone
Facebook has noticed the errors, too. In February, the social ad biz published a related paper, “Silent Data Corruption at Scale,” that states, “Silent data corruptions are becoming a more common phenomena in data centers than previously observed.” The paper proposes mitigation strategies though doesn’t address the root cause.
As Google’s researchers see it, Facebook spotted a symptom of unreliable cores – silent data corruption. But identifying the cause of the problem, and coming up with a fix, will require further work.
The risks posed by misbehaving cores include not only crashes, which the existing fail-stop model for error handling can accommodate, but also incorrect calculations and data loss, which may go unnoticed and pose a particular risk at scale.
Hochschild recounted an instance where Google’s errant hardware conducted what might be described as an auto-erratic ransomware attack.
“One of our mercurial cores corrupted encryption,” he explained. “It did it in such a way that only it could decrypt what it had wrongly encrypted.”
Google’s researchers declined to reveal detected CEE rates at its data centers citing “business reasons,” though they provided a ballpark figure “on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook.”
Ideally, Google would like to see automated methods to identify mercurial cores and has suggested strategies like CPU testing throughout the chip’s lifecycle rather than relying only on burn-in testing prior to deployment. The mega-corp is currently relying on human-driven core integrity interrogation, which is not particularly accurate, because tools and techniques for identifying dubious cores remain works in progress.
“In our recent experience, roughly half of these human identified suspects are actually proven, on deeper investigation, to be mercurial cores – we must extract ‘confessions’ via further testing (often after first developing a new automatable test),” Google’s researchers explain. “The other half is a mix of false accusations and limited reproducibility.”