Cores that don't count

submited by

Style Pass

2021-06-26 07:30:06

So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time.

This is mostly due to ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse manufacturing defects --especially those defects that manifest in corner cases (under certain voltage, frequency, temperature), or only after post-deployment aging. Chip manifacturing is magic, and with 5nm technology some gates are of the length of 10 atoms, which can lead to flaky behavior.

The paper says this. CEEs are harder to root-cause than software bugs, which we usually assume we can debug by reproducing on a different machine. In just a few cases, we can reproduce the errors deterministically; usually the implementation-level and environmental details have to line up. Data patterns can affect corruption rates, but it’s often hard for us to tell. Some specific examples where we have seen CEE: