Internal Error Indications

Internal Error Indications Hardware may detect a variety of problems during operation, ranging from soft errors which have already been corrected by the time they are reported, to hard errors of such severity that the OS (and perhaps the hardware) cannot meaningfully continue operation. The mechanisms described in are used to report such errors to the OS. This section describes the architectural sources of errors, and describes a method that platforms can use to report the error. All OSs need to be prepared to encounter the errors reported as they are described here. However, in some platforms more sophisticated handling may be introduced via RTAS, and the OS may not have to handle the error directly. More robust error detection, reporting, and correcting are at the option of the hardware vendor. The primary architectural mechanism for indicating hardware errors to an OS is the machine check interrupt. If an error condition is surfaced by placing the system in checkstop, it precludes any immediate participation by the OS in handling the error (that is, no error capture, logging, recovery, analysis, or notification by the OS). For this reason, the machine check interrupt is preferred over going to the checkstop state. However, checkstop may be necessary in certain situations where further processing represents an exposure to loss of data integrity. To better handle such cases, a special hardware mechanism may be provided to gather and store residual error data, to be analyzed when the system goes through a subsequent successful reboot. Less critical internal errors may also be signaled to the OS through a platform-specific interrupt in the “Internal Errors” class, or by periodic polling with the event-scan RTAS function. Architecture Note: The machine check interrupt will not be listed in the OF node for the “Internal Errors” class, since it is a standard architectural mechanism. The machine check interrupt mechanism is enabled from software or firmware by setting the MSRME bit =1. Upon the occurrence of a machine check interrupt, bits in SRR1 will indicate the source of the interrupt and SRR0 will contain the address of the next instruction that would have been executed if the interrupt had not occurred. Depending on where the error is detected, the machine check interrupt may be surfaced from within the processor, via logical connection to the processor machine check interrupt pin, or via a system bus error indicator (for example, Transfer Error Acknowledge - TEA). R1--1. OSs must set MSRME=1 prior to the occurrence of a machine check interrupt in order to enable machine check processing via the check-exception RTAS function. Architecture Note: Requirement is not applicable when the FWNMI option is used. R1--2. For hardware-detected errors, platforms must generate error indications as described in , unless the error can be handled through a less severe platform-specific interrupt, or the nature of the error forces a checkstop condition. R1--3. Platforms which detect and report the errors described in must provide information to the OS via the RTAS check-exception function, using the reporting format described in . R1--4. To prevent error propagation and allow for synchronization of error handling, all processors in a multi-processor system must receive any machine check interrupt signaled via the external machine check interrupt pin. Platform Implementation Notes: The intent of Requirement is to define standard error notification mechanisms for different hardware error types. For most hardware errors, the signaling mechanism is the machine check interrupt, although this requirement hints at the use of a less severe platform-specific interrupt for some errors. The important point here is actually whether the error can be handled through that interrupt. Simply using an external interrupt to signal the error is not sufficient. The hardware and RTAS also need to limit the propagation of corrupted data, prevent loss of error state data, and support the cleanup and recovery of such an error. Since the response to an external interrupt may be significantly slower than a machine check, and in fact may be masked, the error should not require immediate action on the part of the OS and/or RTAS. In addition, external interrupts (except external machine check interrupts) are reported to only one processor, so operations by the other processors in an MP system should not be impacted by this error unless they specifically try to access the failing hardware element. To summarize, platforms should not use platform-specific interrupts to signal hardware errors unless there is a complete hardware/RTAS platform solution for handling such errors. The intent of Requirement is that most hardware errors would be signaled simultaneously to all processors, so that processors could synchronize the error handling process; that is, one processor would be chosen to do the call to check-exception, while the other processors remained idle so that they would not interfere with RTAS while it gathered machine check error data. While this is a straightforward wiring solution for errors signaled via the external machine check interrupt pin, that is not the case for internal processor errors or processor bus errors. Typically, only one processor will see such errors. An internal processor error can be identified with just the contents of SRR1, and so can be handled without synchronization with other processors. Processor bus errors, however, can be more difficult, especially if the error is propagated up to the processor bus from a lower-level bus. In general, such propagation should be avoided, and such errors should be signaled through the machine check interrupt pin to ensure proper error handling.

Error Indication Mechanisms describes the mechanisms by which software will be notified of the occurrence of operational failures during the types of data transfer operations listed below. The assumption here is that the error notification can occur only if a hardware mechanism for error detection (for example, a parity checker) is present. In cases where there is no specific error detection mechanism, the resulting condition, and whether the software will eventually recognize that condition as a failure, is undefined. Error Indications for System Operations Initiator Target Operation Error Type(if detected) Indication to Software Comments Processor N/A Internal Various Machine check Some may cause checkstop Processor Memory Load Invalid address Machine check System bus time-out Machine check Address parity on system bus Machine check Data parity on system bus Machine check Memory parity or uncorrectable ECC Machine check Store Invalid address Machine check System bus time-out Machine check Address parity on system bus Machine check Data parity on system bus Machine check External cache load Memory parity or uncorrectable ECC Machine check Associated with Instruction Fetch or Data Load External cache flush Cache parity or uncorrectable ECC Machine check External cache access Cache parity or uncorrectable ECC Machine check Associated with Instruction Fetch or Data Transfer Processor I/O Load or Store Various errors between the processor and the I/O fabric Machine check I/O fabrics include hubs and bridges and interconnecting buses or links. Processor I/O bus configuration space Read Various, except no response from IOA Firmware receives a machine check, OS receives all=1’s data along with a Status of -1 from the RTAS call If EEH is implemented and enabled, firmware does not get a machine check and the PE is in the EEH Stopped State on return from the RTAS call No response from an IOA All-1’s data returned, along with a “Success” Status from the RTAS call If EEH is implemented and enabled, the PE is not in the EEH Stopped State on return from the RTAS call Write Various, except no response from IOA Firmware receives a machine check, OS receives a Status of -1 from the RTAS call If EEH is implemented and enabled, firmware does not get a machine check and the PE is in the EEH Stopped State on return from the RTAS call No response from an IOA Operation ignored, OS receives a “Success” Status from the RTAS call If EEH is implemented and enabled, the PE is in the EEH Stopped State on return from the RTAS call Processor I/O bus; I/O Space or Memory Space Load Various, except no response from IOA Machine check If EEH is implemented and enabled, no machine check is received, all-1’s data is returned, and the PE enters the EEH Stopped State No response from an IOA All-1’s data returned Invalid address, broken IOA, or configuration cycle to non-existent IOA; if EEH is implemented and enabled, the PE enters the EEH Stopped State Store Various, except no response from IOA Machine check If EEH is implemented and enabled, no machine check is received and the PE enters the EEH Stopped State No response from IOA Ignore Store Invalid address, broken IOA, or configuration cycle to non-existent IOA; if EEH is implemented and enabled, the PE enters the EEH Stopped State Processor Invalid address (addressing an “undefined” address area) Load or Store No response from system Machine check I/O Memory DMA - either direction Various, including but not limited to: Invalid addr accepted by a bridge TCE extent TCE Page Mapping and Control mis-match or invalid TCE Machine check unless reportable directly to the IOA in a way that allows the IOA to signal the error to its device driver If EEH is implemented and enabled, no machine check is received and the PE enters the EEH Stopped State I/O I/O DMA - either direction Various Machine check unless reportable to master of the transfer in a way that allows master to recover I/O Invalid address DMA - either direction No response from any IOA PCI IOA master-aborts Signal device driver using an external interrupt PCI IOA - Any Internal, indicated by SERR# or ERR_FATAL SERR# or ERR_FATAL, causing machine check If EEH is implemented and enabled, no machine check is received and the PE enters the EEH Stopped State

Implementation Note: IOAs should, whenever possible, detect the occurrence of PCI errors on DMA and report them via an external interrupt (for possible device driver recovery) or retry the operation. Since system state has not been lost, reporting these errors via a machine check to the CPUs is inappropriate. Some devices or device drivers may cause a catastrophic error. Systems which wish to recover from these types of errors should choose devices and device drivers which are designed to handle them correctly.