Ecache SRAM Data Parity Error

this page was build using information found in everything2.com
 
This error message does in fact indicate an ecache problem on an UltraSPARC-II based server. Some little-known facts about the problem:
  • It affected all UltraSPARC-II based systems, though it (obviously) was more likely to appear on machines with larger numbers of CPUs and in CPUs with larger caches; the overall effect of this was that the problem did not become widespread until a lot of Enterprise class servers with CPUs with 8 MB of ecache were in the field.
  • It was a "soft error", in the sense that a CPU module that had produced the error once was no more likely to produce it a second time than any other module was to produce it the first time.
  • No one inside Sun has been able to determine a better explanation for the cause of the problem than cosmic rays. And actually, when you think about it, 8MB of non-error correcting SRAM is a pretty good cosmic ray detector.
  • There seemed to be a number of factors that made CPU modules more likely to have this error. In at least one case, a machine that had been experiencing the problem was found to be in a "hot spot" in the computer room, where a quirk of the HVAC system didn't provide enough circulation of cold air to keep the air exiting the machine within specification. After the machine was moved, the problems went away. In another case, moving the machine away from an elevator shaft caused the problems to stop occurring. No one is sure whether the contributing factor there was electromagnetic fields or vibration from the elevator. And, annoyingly enough, modules that had been installed in the field were more likely to exhibit the problem than modules installed in the factory. This led to situations where Sun was trying to explain to the customers that they'd be better off not replacing CPU modules that had failed and the customers were insisting on replacements.
  • The problem was more likely to occur on lightly loaded systems; this makes sense once you think about the fact that the cache entries on busy systems are invalidated and reloaded much more frequently.
The UltraSPARC-III CPUs from Sun have ECC protection on the ecache.

Cosmic rays actually do cause bit rot. A study in the 80s by IBM placed RAM testers in Boulder, Colorado, Leadville, New York City, and underground in Kansas City. Boulder had 5 times more errors than New York, and Leadville had ten times as many as New York. The elevation of the towns has a lot to do with it, since Leadville doesn't have as much atmosphere to absorb sub-atomic particles at 10,152ft. Boulder is at about 5,000ft. New York is at sea level. However, the shape of the earth's magnetic field has a lot to do with it, too. la Paz has a similar altitude to Leadville's, but is at a different latitude.
The sub-atomic particles that make up cosmic rays knock electrons out of orbit, generating just enough voltage to send a gate into the wrong digital state.
The effects get worse with smaller components. Makers of modern microprocessors have to be very careful about terrestrial radiation, as well. If their chip fab becomes radioactive, it will not turn out working chips.
All this stuff can be found in an IBM research journal somewhere. Specifically, the IBM Journal of Research and Development, Volume 40, Number 1.
So keep your smoke detector away from your chips!

cosmic rays n.

Notionally, the cause of bit rot. However, this is a semi-independent usage that may be invoked as a humorous way to handwave away any minor randomness that doesn't seem worth the bother of investigating. "Hey, Eric -- I just got a burst of garbage on my tube, where did that come from?" "Cosmic rays, I guess." Compare sunspots, phase of the moon. The British seem to prefer the usage `cosmic showers'; `alpha particles' is also heard, because stray alpha particles passing through a memory chip can cause single-bit errors (this becomes increasingly more likely as memory sizes and densities increase).

Factual note: Alpha particles cause bit rot, cosmic rays do not (except occasionally in spaceborne computers). Intel could not explain random bit drops in their early chips, and one hypothesis was cosmic rays. So they created the World's Largest Lead Safe, using 25 tons of the stuff, and used two identical boards for testing. One was placed in the safe, one outside. The hypothesis was that if cosmic rays were causing the bit drops, they should see a statistically significant difference between the error rates on the two boards. They did not observe such a difference. Further investigation demonstrated conclusively that the bit drops were due to alpha particle emissions from thorium (and to a much lesser degree uranium) in the encapsulation material. Since it is impossible to eliminate these radioactives (they are uniformly distributed through the earth's crust, with the statistically insignificant exception of uranium lodes) it became obvious that one has to design memories to withstand these hits.

 

 

 


Last changes: Friday, February 09, 2007 03:32:13 PM,
:P 2003 filibeto.org, site statistics