You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Linux-Architecture-Reference/LoPAR/sec_rtas_error_indications.xml

712 lines
25 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2016, 2020 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="Internal_Error_Indications">
<title>Internal Error Indications</title>
<para>Hardware may detect a variety of problems during operation, ranging
from soft errors which have already been corrected by the time they are
reported, to hard errors of such severity that the OS (and perhaps the
hardware) cannot meaningfully continue operation. The mechanisms
described in
<xref linkend="dbdoclet.50569337_66258"/> are used to report such errors to the
OS. This section describes the architectural sources of errors, and
describes a method that platforms can use to report the error. All OSs
need to be prepared to encounter the errors reported as they are
described here. However, in some platforms more sophisticated handling
may be introduced via RTAS, and the OS may not have to handle the error
directly. More robust error detection, reporting, and correcting are at
the option of the hardware vendor.</para>
<para>
<anchor xml:id="dbdoclet.50569337_marker-13885" xreflabel="" />
<anchor xml:id="dbdoclet.50569337_marker-13886" xreflabel="" />The
primary architectural mechanism for indicating hardware errors to an OS
is the machine check interrupt. If an error condition is surfaced by
placing the system in checkstop, it precludes any immediate participation
by the OS in handling the error (that is, no error capture, logging,
recovery, analysis, or notification by the OS). For this reason, the
machine check interrupt is preferred over going to the checkstop state.
However, checkstop may be necessary in certain situations where further
processing represents an exposure to loss of data integrity. To better
handle such cases, a special hardware mechanism may be provided to gather
and store residual error data, to be analyzed when the system goes
through a subsequent successful reboot.</para>
<para>Less critical internal errors may also be signaled to the OS
through a platform-specific interrupt in the &#8220;Internal
Errors&#8221; class, or by periodic polling with the
<emphasis>event-scan</emphasis> RTAS function.</para>
<para><emphasis role="bold">Architecture Note:</emphasis> The machine check interrupt will not be listed
in the OF node for the &#8220;Internal Errors&#8221; class, since it is a
standard architectural mechanism. The machine check interrupt mechanism
is enabled from software or firmware by setting the MSRME bit =1. Upon
the occurrence of a machine check interrupt, bits in SRR1 will indicate
the source of the interrupt and SRR0 will contain the address of the next
instruction that would have been executed if the interrupt had not
occurred. Depending on where the error is detected, the machine check
interrupt may be surfaced from within the processor, via logical
connection to the processor machine check interrupt pin, or via a system
bus error indicator (for example, Transfer Error Acknowledge -
TEA).</para>
<variablelist>
<varlistentry xml:id="dbdoclet.50569337_38528">
<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
xrefstyle="select: labelnumber nopage"/>-1.</emphasis></term>
<listitem>
<para>OSs must set
MSRME=1 prior to the occurrence of a machine check interrupt in order to
enable machine check processing via the
<emphasis>check-exception</emphasis> RTAS function.</para>
<para>
<emphasis role="bold">Architecture Note:</emphasis> Requirement
<xref linkend="dbdoclet.50569337_38528"/> is not applicable when the FWNMI
option is used.</para>
</listitem>
</varlistentry>
<varlistentry xml:id="dbdoclet.50569337_36603">
<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
xrefstyle="select: labelnumber nopage"/>-2.</emphasis></term>
<listitem>
<para>For
hardware-detected errors, platforms must generate error indications as
described in
<xref linkend="dbdoclet.50569337_30773"/>, unless the error can be handled
through a less severe platform-specific interrupt, or the nature of the
error forces a checkstop condition.</para>
</listitem>
</varlistentry>
<varlistentry>
<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
xrefstyle="select: labelnumber nopage"/>-3.</emphasis></term>
<listitem>
<para>Platforms which detect and report the
errors described in
<xref linkend="dbdoclet.50569337_30773"/> must provide information to the OS
via the RTAS
<emphasis>check-exception</emphasis> function, using the reporting format
described in
<xref linkend="dbdoclet.50569337_21249"/>.</para>
</listitem>
</varlistentry>
<varlistentry xml:id="dbdoclet.50569337_16228">
<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
xrefstyle="select: labelnumber nopage"/>-4.</emphasis></term>
<listitem>
<para>To prevent error
propagation and allow for synchronization of error handling, all
processors in a multi-processor system must receive any machine check
interrupt signaled via the external machine check interrupt pin.</para>
</listitem>
</varlistentry>
</variablelist>
<para>
<emphasis role="bold">Platform Implementation Notes:</emphasis>
</para>
<orderedlist>
<listitem>
<para>The intent of Requirement
<xref linkend="dbdoclet.50569337_36603"/> is to define standard error
notification mechanisms for different hardware error types. For most
hardware errors, the signaling mechanism is the machine check interrupt,
although this requirement hints at the use of a less severe
platform-specific interrupt for some errors. The important point here is
actually whether the error can be handled through that interrupt. Simply
using an external interrupt to signal the error is not sufficient. The
hardware and RTAS also need to limit the propagation of corrupted data,
prevent loss of error state data, and support the cleanup and recovery of
such an error. Since the response to an external interrupt may be
significantly slower than a machine check, and in fact may be masked, the
error should not require immediate action on the part of the OS and/or
RTAS. In addition, external interrupts (except external machine check
interrupts) are reported to only one processor, so operations by the
other processors in an MP system should not be impacted by this error
unless they specifically try to access the failing hardware element. To
summarize, platforms should not use platform-specific interrupts to
signal hardware errors unless there is a complete hardware/RTAS platform
solution for handling such errors.</para>
</listitem>
<listitem>
<para>The intent of Requirement
<xref linkend="dbdoclet.50569337_16228"/> is that most hardware errors would be
signaled simultaneously to all processors, so that processors could
synchronize the error handling process; that is, one processor would be
chosen to do the call to
<emphasis>check-exception</emphasis>, while the other processors remained
idle so that they would not interfere with RTAS while it gathered machine
check error data. While this is a straightforward wiring solution for
errors signaled via the external machine check interrupt pin, that is not
the case for internal processor errors or processor bus errors.
Typically, only one processor will see such errors. An internal processor
error can be identified with just the contents of SRR1, and so can be
handled without synchronization with other processors. Processor bus
errors, however, can be more difficult, especially if the error is
propagated up to the processor bus from a lower-level bus. In general,
such propagation should be avoided, and such errors should be signaled
through the machine check interrupt pin to ensure proper error
handling.</para>
</listitem>
</orderedlist>
<section xml:id="dbdoclet.50569337_52952">
<title>Error Indication Mechanisms</title>
<para>
<xref linkend="dbdoclet.50569337_30773"/> describes the mechanisms by
which software will be notified of the occurrence of operational failures
during the types of data transfer operations listed below. The assumption
here is that the error notification can occur only if a hardware
mechanism for error detection (for example, a parity checker) is present.
In cases where there is no specific error detection mechanism, the
resulting condition, and whether the software will eventually recognize
that condition as a failure, is undefined.</para>
<para>&#160;</para>
<table frame="all" pgwide="1" xml:id="dbdoclet.50569337_30773">
<title>Error Indications for System Operations</title>
<tgroup cols="6">
<colspec colname="c1" colwidth="10*" align="center" />
<colspec colname="c2" colwidth="10*" align="center" />
<colspec colname="c3" colwidth="20*" align="center" />
<colspec colname="c4" colwidth="20*" align="center" />
<colspec colname="c5" colwidth="20*" align="center" />
<colspec colname="c6" colwidth="20*" align="center" />
<thead>
<row>
<entry>
<para>
<emphasis role="bold">Initiator</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Target</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Operation</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Error Type(if detected)</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Indication to Software</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Comments</emphasis>
</para>
</entry>
</row>
</thead>
<tbody valign="middle" >
<row>
<entry>
<para>Processor</para>
</entry>
<entry>
<para>N/A</para>
</entry>
<entry>
<para>Internal</para>
</entry>
<entry>
<para>Various</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>Some may cause checkstop</para>
</entry>
</row>
<row>
<entry morerows="11">
<para>Processor</para>
</entry>
<entry morerows="11">
<para>Memory</para>
</entry>
<entry morerows="4">
<para>Load</para>
</entry>
<entry>
<para>Invalid address</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>System bus time-out</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>Address parity on system bus</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>Data parity on system bus</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>Memory parity or uncorrectable ECC</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry morerows="3">
<para>Store</para>
</entry>
<entry>
<para>Invalid address</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>System bus time-out</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>Address parity on system bus</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>Data parity on system bus</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>External cache load</para>
</entry>
<entry>
<para>Memory parity or uncorrectable ECC</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>Associated with Instruction Fetch or Data Load</para>
</entry>
</row>
<row>
<entry>
<para>External cache flush</para>
</entry>
<entry>
<para>Cache parity or</para>
<para>uncorrectable ECC</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>External cache access</para>
</entry>
<entry>
<para>Cache parity or</para>
<para>uncorrectable ECC</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>Associated with Instruction Fetch or Data Transfer</para>
</entry>
</row>
<row>
<entry>
<para>Processor</para>
</entry>
<entry>
<para>I/O</para>
</entry>
<entry>
<para>Load or Store</para>
</entry>
<entry>
<para>Various errors between the processor and the I/O
fabric</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>I/O fabrics include hubs and bridges and interconnecting
buses or links.</para>
</entry>
</row>
<row>
<entry morerows="3">
<para>Processor</para>
</entry>
<entry morerows="3">
<para>I/O bus configuration space</para>
</entry>
<entry morerows="1">
<para>Read</para>
</entry>
<entry>
<para>Various, except no response from IOA</para>
</entry>
<entry>
<para>Firmware receives a machine check, OS receives
all=1&#8217;s data along with a
<emphasis>Status</emphasis> of -1 from the RTAS call</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, firmware does not get
a machine check and the PE is in the EEH Stopped State on
return from the RTAS call</para>
</entry>
</row>
<row>
<entry>
<para>No response from an IOA</para>
</entry>
<entry>
<para>All-1&#8217;s data returned, along with a
&#8220;Success&#8221;
<emphasis>Status</emphasis> from the RTAS call</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, the PE is not in the
EEH Stopped State on return from the RTAS call</para>
</entry>
</row>
<row>
<entry morerows="1">
<para>Write</para>
</entry>
<entry>
<para>Various, except no response from IOA</para>
</entry>
<entry>
<para>Firmware receives a machine check, OS receives a
<emphasis>Status</emphasis> of -1 from the RTAS call</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, firmware does not get
a machine check and the PE is in the EEH Stopped State on
return from the RTAS call</para>
</entry>
</row>
<row>
<entry>
<para>No response from an IOA</para>
</entry>
<entry>
<para>Operation ignored, OS receives a &#8220;Success&#8221;
<emphasis>Status</emphasis> from the RTAS call</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, the PE is in the EEH
Stopped State on return from the RTAS call</para>
</entry>
</row>
<row>
<entry morerows="3">
<para>Processor</para>
</entry>
<entry morerows="3">
<para>I/O bus;</para>
<para>I/O Space</para>
<para>or</para>
<para>Memory Space</para>
</entry>
<entry morerows="1">
<para>Load</para>
</entry>
<entry>
<para>Various, except no response from IOA</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, no machine check is
received, all-1&#8217;s data is returned, and the PE enters the
EEH Stopped State</para>
</entry>
</row>
<row>
<entry>
<para>No response from an IOA</para>
</entry>
<entry>
<para>All-1&#8217;s data returned</para>
</entry>
<entry>
<para>Invalid address, broken IOA, or configuration cycle to
non-existent IOA; if EEH is implemented and enabled, the PE
enters the EEH Stopped State</para>
</entry>
</row>
<row>
<entry morerows="1">
<para>Store</para>
</entry>
<entry>
<para>Various, except no response from IOA</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State</para>
</entry>
</row>
<row>
<entry>
<para>No response from IOA</para>
</entry>
<entry>
<para>Ignore Store</para>
</entry>
<entry>
<para>Invalid address, broken IOA, or configuration cycle to
non-existent IOA; if EEH is implemented and enabled, the PE
enters the EEH Stopped State</para>
</entry>
</row>
<row>
<entry>
<para>Processor</para>
</entry>
<entry>
<para>Invalid address (addressing an &#8220;undefined&#8221;
address area)</para>
</entry>
<entry>
<para>Load or Store</para>
</entry>
<entry>
<para>No response from system</para>
</entry>
<entry>
<para>Machine check</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>I/O</para>
</entry>
<entry>
<para>Memory</para>
</entry>
<entry>
<para>DMA - either direction</para>
</entry>
<entry>
<para>Various, including but not limited to:</para>
<itemizedlist>
<listitem>
<para>Invalid addr accepted by a bridge</para>
</listitem>
<listitem>
<para>TCE extent</para>
</listitem>
<listitem>
<para>TCE Page Mapping and Control mis-match or invalid
TCE</para>
</listitem>
</itemizedlist>
</entry>
<entry>
<para>Machine check unless reportable directly to the IOA in a
way that allows the IOA to signal the error to its device
driver</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State</para>
</entry>
</row>
<row>
<entry>
<para>I/O</para>
</entry>
<entry>
<para>I/O</para>
</entry>
<entry>
<para>DMA - either direction</para>
</entry>
<entry>
<para>Various</para>
</entry>
<entry>
<para>Machine check unless reportable to master of the transfer
in a way that allows master to recover</para>
</entry>
<entry>
<para>&#160;</para>
</entry>
</row>
<row>
<entry>
<para>I/O</para>
</entry>
<entry>
<para>Invalid address</para>
</entry>
<entry>
<para>DMA - either direction</para>
</entry>
<entry>
<para>No response from any IOA</para>
</entry>
<entry>
<para>PCI IOA</para>
<para>master-aborts</para>
</entry>
<entry>
<para>Signal device driver using an external interrupt</para>
</entry>
</row>
<row>
<entry>
<para>PCI IOA</para>
</entry>
<entry>
<para>-</para>
</entry>
<entry>
<para>Any</para>
</entry>
<entry>
<para>Internal, indicated by SERR# or ERR_FATAL</para>
</entry>
<entry>
<para>SERR# or ERR_FATAL, causing machine check</para>
</entry>
<entry>
<para>If EEH is implemented and enabled, no machine check is
received and the PE enters the EEH Stopped State</para>
</entry>
</row>
</tbody>
</tgroup>
</table>
<para><emphasis role="bold">Implementation Note:</emphasis> IOAs should, whenever possible, detect the
occurrence of PCI errors on DMA and report them via an external interrupt
(for possible device driver recovery) or retry the operation. Since
system state has not been lost, reporting these errors via a machine
check to the CPUs is inappropriate. Some devices or device drivers may
cause a catastrophic error. Systems which wish to recover from these
types of errors should choose devices and device drivers which are
designed to handle them correctly.</para>
</section>
</section>