Internal Error Indications
Hardware may detect a variety of problems during operation, ranging
from soft errors which have already been corrected by the time they are
reported, to hard errors of such severity that the OS (and perhaps the
hardware) cannot meaningfully continue operation. The mechanisms
described in
are used to report such errors to the
OS. This section describes the architectural sources of errors, and
describes a method that platforms can use to report the error. All OSs
need to be prepared to encounter the errors reported as they are
described here. However, in some platforms more sophisticated handling
may be introduced via RTAS, and the OS may not have to handle the error
directly. More robust error detection, reporting, and correcting are at
the option of the hardware vendor.
The
primary architectural mechanism for indicating hardware errors to an OS
is the machine check interrupt. If an error condition is surfaced by
placing the system in checkstop, it precludes any immediate participation
by the OS in handling the error (that is, no error capture, logging,
recovery, analysis, or notification by the OS). For this reason, the
machine check interrupt is preferred over going to the checkstop state.
However, checkstop may be necessary in certain situations where further
processing represents an exposure to loss of data integrity. To better
handle such cases, a special hardware mechanism may be provided to gather
and store residual error data, to be analyzed when the system goes
through a subsequent successful reboot.
Less critical internal errors may also be signaled to the OS
through a platform-specific interrupt in the “Internal
Errors” class, or by periodic polling with the
event-scan RTAS function.
Architecture Note: The machine check interrupt will not be listed
in the OF node for the “Internal Errors” class, since it is a
standard architectural mechanism. The machine check interrupt mechanism
is enabled from software or firmware by setting the MSRME bit =1. Upon
the occurrence of a machine check interrupt, bits in SRR1 will indicate
the source of the interrupt and SRR0 will contain the address of the next
instruction that would have been executed if the interrupt had not
occurred. Depending on where the error is detected, the machine check
interrupt may be surfaced from within the processor, via logical
connection to the processor machine check interrupt pin, or via a system
bus error indicator (for example, Transfer Error Acknowledge -
TEA).
R1--1.
OSs must set
MSRME=1 prior to the occurrence of a machine check interrupt in order to
enable machine check processing via the
check-exception RTAS function.
Architecture Note: Requirement
is not applicable when the FWNMI
option is used.
R1--2.
For
hardware-detected errors, platforms must generate error indications as
described in
, unless the error can be handled
through a less severe platform-specific interrupt, or the nature of the
error forces a checkstop condition.
R1--3.
Platforms which detect and report the
errors described in
must provide information to the OS
via the RTAS
check-exception function, using the reporting format
described in
.
R1--4.
To prevent error
propagation and allow for synchronization of error handling, all
processors in a multi-processor system must receive any machine check
interrupt signaled via the external machine check interrupt pin.
Platform Implementation Notes:
The intent of Requirement
is to define standard error
notification mechanisms for different hardware error types. For most
hardware errors, the signaling mechanism is the machine check interrupt,
although this requirement hints at the use of a less severe
platform-specific interrupt for some errors. The important point here is
actually whether the error can be handled through that interrupt. Simply
using an external interrupt to signal the error is not sufficient. The
hardware and RTAS also need to limit the propagation of corrupted data,
prevent loss of error state data, and support the cleanup and recovery of
such an error. Since the response to an external interrupt may be
significantly slower than a machine check, and in fact may be masked, the
error should not require immediate action on the part of the OS and/or
RTAS. In addition, external interrupts (except external machine check
interrupts) are reported to only one processor, so operations by the
other processors in an MP system should not be impacted by this error
unless they specifically try to access the failing hardware element. To
summarize, platforms should not use platform-specific interrupts to
signal hardware errors unless there is a complete hardware/RTAS platform
solution for handling such errors.
The intent of Requirement
is that most hardware errors would be
signaled simultaneously to all processors, so that processors could
synchronize the error handling process; that is, one processor would be
chosen to do the call to
check-exception, while the other processors remained
idle so that they would not interfere with RTAS while it gathered machine
check error data. While this is a straightforward wiring solution for
errors signaled via the external machine check interrupt pin, that is not
the case for internal processor errors or processor bus errors.
Typically, only one processor will see such errors. An internal processor
error can be identified with just the contents of SRR1, and so can be
handled without synchronization with other processors. Processor bus
errors, however, can be more difficult, especially if the error is
propagated up to the processor bus from a lower-level bus. In general,
such propagation should be avoided, and such errors should be signaled
through the machine check interrupt pin to ensure proper error
handling.
Error Indication Mechanisms
describes the mechanisms by
which software will be notified of the occurrence of operational failures
during the types of data transfer operations listed below. The assumption
here is that the error notification can occur only if a hardware
mechanism for error detection (for example, a parity checker) is present.
In cases where there is no specific error detection mechanism, the
resulting condition, and whether the software will eventually recognize
that condition as a failure, is undefined.
Error Indications for System Operations
Initiator
Target
Operation
Error Type(if detected)
Indication to Software
Comments
Processor
N/A
Internal
Various
Machine check
Some may cause checkstop
Processor
Memory
Load
Invalid address
Machine check
System bus time-out
Machine check
Address parity on system bus
Machine check
Data parity on system bus
Machine check
Memory parity or uncorrectable ECC
Machine check
Store
Invalid address
Machine check
System bus time-out
Machine check
Address parity on system bus
Machine check
Data parity on system bus
Machine check
External cache load
Memory parity or uncorrectable ECC
Machine check
Associated with Instruction Fetch or Data Load
External cache flush
Cache parity or
uncorrectable ECC
Machine check
External cache access
Cache parity or
uncorrectable ECC
Machine check
Associated with Instruction Fetch or Data Transfer
Processor
I/O
Load or Store
Various errors between the processor and the I/O
fabric
Machine check
I/O fabrics include hubs and bridges and interconnecting
buses or links.
Processor
I/O bus configuration space
Read
Various, except no response from IOA
Firmware receives a machine check, OS receives
all=1’s data along with a
Status of -1 from the RTAS call
If EEH is implemented and enabled, firmware does not get
a machine check and the PE is in the EEH Stopped State on
return from the RTAS call
No response from an IOA
All-1’s data returned, along with a
“Success”
Status from the RTAS call
If EEH is implemented and enabled, the PE is not in the
EEH Stopped State on return from the RTAS call
Write
Various, except no response from IOA
Firmware receives a machine check, OS receives a
Status of -1 from the RTAS call
If EEH is implemented and enabled, firmware does not get
a machine check and the PE is in the EEH Stopped State on
return from the RTAS call
No response from an IOA
Operation ignored, OS receives a “Success”
Status from the RTAS call
If EEH is implemented and enabled, the PE is in the EEH
Stopped State on return from the RTAS call
Processor
I/O bus;
I/O Space
or
Memory Space
Load
Various, except no response from IOA
Machine check
If EEH is implemented and enabled, no machine check is
received, all-1’s data is returned, and the PE enters the
EEH Stopped State
No response from an IOA
All-1’s data returned
Invalid address, broken IOA, or configuration cycle to
non-existent IOA; if EEH is implemented and enabled, the PE
enters the EEH Stopped State
Store
Various, except no response from IOA
Machine check
If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State
No response from IOA
Ignore Store
Invalid address, broken IOA, or configuration cycle to
non-existent IOA; if EEH is implemented and enabled, the PE
enters the EEH Stopped State
Processor
Invalid address (addressing an “undefined”
address area)
Load or Store
No response from system
Machine check
I/O
Memory
DMA - either direction
Various, including but not limited to:
Invalid addr accepted by a bridge
TCE extent
TCE Page Mapping and Control mis-match or invalid
TCE
Machine check unless reportable directly to the IOA in a
way that allows the IOA to signal the error to its device
driver
If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State
I/O
I/O
DMA - either direction
Various
Machine check unless reportable to master of the transfer
in a way that allows master to recover
I/O
Invalid address
DMA - either direction
No response from any IOA
PCI IOA
master-aborts
Signal device driver using an external interrupt
PCI IOA
-
Any
Internal, indicated by SERR# or ERR_FATAL
SERR# or ERR_FATAL, causing machine check
If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State
Implementation Note: IOAs should, whenever possible, detect the
occurrence of PCI errors on DMA and report them via an external interrupt
(for possible device driver recovery) or retry the operation. Since
system state has not been lost, reporting these errors via a machine
check to the CPUs is inappropriate. Some devices or device drivers may
cause a catastrophic error. Systems which wish to recover from these
types of errors should choose devices and device drivers which are
designed to handle them correctly.