Linux-Architecture-Reference/LoPAR/sec_rtas_error_indications.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Copyright (c) 2016, 2020 OpenPOWER Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.

-->
<section xmlns="http://docbook.org/ns/docbook"
         xmlns:xi="http://www.w3.org/2001/XInclude"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         version="5.0"
         xml:id="Internal_Error_Indications">

  <title>Internal Error Indications</title>

  <para>Hardware may detect a variety of problems during operation, ranging
  from soft errors which have already been corrected by the time they are
  reported, to hard errors of such severity that the OS (and perhaps the
  hardware) cannot meaningfully continue operation. The mechanisms
  described in
  <xref linkend="dbdoclet.50569337_66258"/> are used to report such errors to the
  OS. This section describes the architectural sources of errors, and
  describes a method that platforms can use to report the error. All OSs
  need to be prepared to encounter the errors reported as they are
  described here. However, in some platforms more sophisticated handling
  may be introduced via RTAS, and the OS may not have to handle the error
  directly. More robust error detection, reporting, and correcting are at
  the option of the hardware vendor.</para>

  <para>
  <anchor xml:id="dbdoclet.50569337_marker-13885" xreflabel="" />
  <anchor xml:id="dbdoclet.50569337_marker-13886" xreflabel="" />The
  primary architectural mechanism for indicating hardware errors to an OS
  is the machine check interrupt. If an error condition is surfaced by
  placing the system in checkstop, it precludes any immediate participation
  by the OS in handling the error (that is, no error capture, logging,
  recovery, analysis, or notification by the OS). For this reason, the
  machine check interrupt is preferred over going to the checkstop state.
  However, checkstop may be necessary in certain situations where further
  processing represents an exposure to loss of data integrity. To better
  handle such cases, a special hardware mechanism may be provided to gather
  and store residual error data, to be analyzed when the system goes
  through a subsequent successful reboot.</para>

  <para>Less critical internal errors may also be signaled to the OS
  through a platform-specific interrupt in the &#8220;Internal
  Errors&#8221; class, or by periodic polling with the
  <emphasis>event-scan</emphasis> RTAS function.</para>

  <para><emphasis role="bold">Architecture Note:</emphasis> The machine check interrupt will not be listed
  in the OF node for the &#8220;Internal Errors&#8221; class, since it is a
  standard architectural mechanism. The machine check interrupt mechanism
  is enabled from software or firmware by setting the MSRME bit =1. Upon
  the occurrence of a machine check interrupt, bits in SRR1 will indicate
  the source of the interrupt and SRR0 will contain the address of the next
  instruction that would have been executed if the interrupt had not
  occurred. Depending on where the error is detected, the machine check
  interrupt may be surfaced from within the processor, via logical
  connection to the processor machine check interrupt pin, or via a system
  bus error indicator (for example, Transfer Error Acknowledge -
  TEA).</para>

  <variablelist>

  	<varlistentry xml:id="dbdoclet.50569337_38528">
  		<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
			xrefstyle="select: labelnumber nopage"/>-1.</emphasis></term>
  			<listitem>
		      <para>OSs must set
		      MSRME=1 prior to the occurrence of a machine check interrupt in order to
		      enable machine check processing via the
		      <emphasis>check-exception</emphasis> RTAS function.</para>

		      <para>
		      <emphasis role="bold">Architecture Note:</emphasis> Requirement
		      <xref linkend="dbdoclet.50569337_38528"/> is not applicable when the FWNMI
		      option is used.</para>

  			</listitem>
  	</varlistentry>

  	<varlistentry xml:id="dbdoclet.50569337_36603">
  		<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
			xrefstyle="select: labelnumber nopage"/>-2.</emphasis></term>
  			<listitem>
		      <para>For
		      hardware-detected errors, platforms must generate error indications as
		      described in
		      <xref linkend="dbdoclet.50569337_30773"/>, unless the error can be handled
		      through a less severe platform-specific interrupt, or the nature of the
		      error forces a checkstop condition.</para>
  			</listitem>
  	</varlistentry>

  	<varlistentry>
  		<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
			xrefstyle="select: labelnumber nopage"/>-3.</emphasis></term>
  			<listitem>
		      <para>Platforms which detect and report the
		      errors described in
		      <xref linkend="dbdoclet.50569337_30773"/> must provide information to the OS
		      via the RTAS
		      <emphasis>check-exception</emphasis> function, using the reporting format
		      described in
		      <xref linkend="dbdoclet.50569337_21249"/>.</para>
  			</listitem>
  	</varlistentry>

  	<varlistentry xml:id="dbdoclet.50569337_16228">
  		<term><emphasis role="bold">R1-<xref linkend="Internal_Error_Indications"
			xrefstyle="select: labelnumber nopage"/>-4.</emphasis></term>
  			<listitem>
		      <para>To prevent error
		      propagation and allow for synchronization of error handling, all
		      processors in a multi-processor system must receive any machine check
		      interrupt signaled via the external machine check interrupt pin.</para>
  			</listitem>
  	</varlistentry>

  </variablelist>

  <para>
    <emphasis role="bold">Platform Implementation Notes:</emphasis>
  </para>

  <orderedlist>
    <listitem>

      <para>The intent of Requirement
      <xref linkend="dbdoclet.50569337_36603"/> is to define standard error
      notification mechanisms for different hardware error types. For most
      hardware errors, the signaling mechanism is the machine check interrupt,
      although this requirement hints at the use of a less severe
      platform-specific interrupt for some errors. The important point here is
      actually whether the error can be handled through that interrupt. Simply
      using an external interrupt to signal the error is not sufficient. The
      hardware and RTAS also need to limit the propagation of corrupted data,
      prevent loss of error state data, and support the cleanup and recovery of
      such an error. Since the response to an external interrupt may be
      significantly slower than a machine check, and in fact may be masked, the
      error should not require immediate action on the part of the OS and/or
      RTAS. In addition, external interrupts (except external machine check
      interrupts) are reported to only one processor, so operations by the
      other processors in an MP system should not be impacted by this error
      unless they specifically try to access the failing hardware element. To
      summarize, platforms should not use platform-specific interrupts to
      signal hardware errors unless there is a complete hardware/RTAS platform
      solution for handling such errors.</para>

    </listitem>
    <listitem>

      <para>The intent of Requirement
      <xref linkend="dbdoclet.50569337_16228"/> is that most hardware errors would be
      signaled simultaneously to all processors, so that processors could
      synchronize the error handling process; that is, one processor would be
      chosen to do the call to
      <emphasis>check-exception</emphasis>, while the other processors remained
      idle so that they would not interfere with RTAS while it gathered machine
      check error data. While this is a straightforward wiring solution for
      errors signaled via the external machine check interrupt pin, that is not
      the case for internal processor errors or processor bus errors.
      Typically, only one processor will see such errors. An internal processor
      error can be identified with just the contents of SRR1, and so can be
      handled without synchronization with other processors. Processor bus
      errors, however, can be more difficult, especially if the error is
      propagated up to the processor bus from a lower-level bus. In general,
      such propagation should be avoided, and such errors should be signaled
      through the machine check interrupt pin to ensure proper error
      handling.</para>

    </listitem>
  </orderedlist>

  <section xml:id="dbdoclet.50569337_52952">
    <title>Error Indication Mechanisms</title>

    <para>
    <xref linkend="dbdoclet.50569337_30773"/> describes the mechanisms by
    which software will be notified of the occurrence of operational failures
    during the types of data transfer operations listed below. The assumption
    here is that the error notification can occur only if a hardware
    mechanism for error detection (for example, a parity checker) is present.
    In cases where there is no specific error detection mechanism, the
    resulting condition, and whether the software will eventually recognize
    that condition as a failure, is undefined.</para>
    <para>&#160;</para>
    <table frame="all" pgwide="1" xml:id="dbdoclet.50569337_30773">
      <title>Error Indications for System Operations</title>
      <tgroup cols="6">
        <colspec colname="c1" colwidth="10*" align="center" />
        <colspec colname="c2" colwidth="10*" align="center" />
        <colspec colname="c3" colwidth="20*" align="center" />
        <colspec colname="c4" colwidth="20*" align="center" />
        <colspec colname="c5" colwidth="20*" align="center" />
        <colspec colname="c6" colwidth="20*" align="center" />
        <thead>
          <row>
            <entry>
              <para>
                <emphasis role="bold">Initiator</emphasis>
              </para>
            </entry>
            <entry>
              <para>
                <emphasis role="bold">Target</emphasis>
              </para>
            </entry>
            <entry>
              <para>
                <emphasis role="bold">Operation</emphasis>
              </para>
            </entry>
            <entry>
              <para>
                <emphasis role="bold">Error Type(if detected)</emphasis>
              </para>
            </entry>
            <entry>
              <para>
                <emphasis role="bold">Indication to Software</emphasis>
              </para>
            </entry>
            <entry>
              <para>
                <emphasis role="bold">Comments</emphasis>
              </para>
            </entry>
          </row>
        </thead>
        <tbody valign="middle" >
          <row>
            <entry>
              <para>Processor</para>
            </entry>
            <entry>
              <para>N/A</para>
            </entry>
            <entry>
              <para>Internal</para>
            </entry>
            <entry>
              <para>Various</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>Some may cause checkstop</para>
            </entry>
          </row>
          <row>
            <entry morerows="11">
              <para>Processor</para>
            </entry>
            <entry morerows="11">
              <para>Memory</para>
            </entry>
            <entry morerows="4">
              <para>Load</para>
            </entry>
            <entry>
              <para>Invalid address</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>System bus time-out</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Address parity on system bus</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Data parity on system bus</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Memory parity or uncorrectable ECC</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry morerows="3">
              <para>Store</para>
            </entry>
            <entry>
              <para>Invalid address</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>System bus time-out</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Address parity on system bus</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Data parity on system bus</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>External cache load</para>
            </entry>
            <entry>
              <para>Memory parity or uncorrectable ECC</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>Associated with Instruction Fetch or Data Load</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>External cache flush</para>
            </entry>
            <entry>
              <para>Cache parity or</para>
              <para>uncorrectable ECC</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>External cache access</para>
            </entry>
            <entry>
              <para>Cache parity or</para>
              <para>uncorrectable ECC</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>Associated with Instruction Fetch or Data Transfer</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Processor</para>
            </entry>
            <entry>
              <para>I/O</para>
            </entry>
            <entry>
              <para>Load or Store</para>
            </entry>
            <entry>
              <para>Various errors between the processor and the I/O
              fabric</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>I/O fabrics include hubs and bridges and interconnecting
              buses or links.</para>
            </entry>
          </row>
          <row>
            <entry morerows="3">
              <para>Processor</para>
            </entry>
            <entry morerows="3">
              <para>I/O bus configuration space</para>
            </entry>
            <entry morerows="1">
              <para>Read</para>
            </entry>
            <entry>
              <para>Various, except no response from IOA</para>
            </entry>
            <entry>
              <para>Firmware receives a machine check, OS receives
              all=1&#8217;s data along with a
              <emphasis>Status</emphasis> of -1 from the RTAS call</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, firmware does not get
              a machine check and the PE is in the EEH Stopped State on
              return from the RTAS call</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>No response from an IOA</para>
            </entry>
            <entry>
              <para>All-1&#8217;s data returned, along with a
              &#8220;Success&#8221;
              <emphasis>Status</emphasis> from the RTAS call</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, the PE is not in the
              EEH Stopped State on return from the RTAS call</para>
            </entry>
          </row>
          <row>
            <entry morerows="1">
              <para>Write</para>
            </entry>
            <entry>
              <para>Various, except no response from IOA</para>
            </entry>
            <entry>
              <para>Firmware receives a machine check, OS receives a
              <emphasis>Status</emphasis> of -1 from the RTAS call</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, firmware does not get
              a machine check and the PE is in the EEH Stopped State on
              return from the RTAS call</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>No response from an IOA</para>
            </entry>
            <entry>
              <para>Operation ignored, OS receives a &#8220;Success&#8221;
              <emphasis>Status</emphasis> from the RTAS call</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, the PE is in the EEH
              Stopped State on return from the RTAS call</para>
            </entry>
          </row>
          <row>
            <entry morerows="3">
              <para>Processor</para>
            </entry>
            <entry morerows="3">
              <para>I/O bus;</para>
              <para>I/O Space</para>
              <para>or</para>
              <para>Memory Space</para>
            </entry>
            <entry morerows="1">
              <para>Load</para>
            </entry>
            <entry>
              <para>Various, except no response from IOA</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, no machine check is
              received, all-1&#8217;s data is returned, and the PE enters the
              EEH Stopped State</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>No response from an IOA</para>
            </entry>
            <entry>
              <para>All-1&#8217;s data returned</para>
            </entry>
            <entry>
              <para>Invalid address, broken IOA, or configuration cycle to
              non-existent IOA; if EEH is implemented and enabled, the PE
              enters the EEH Stopped State</para>
            </entry>
          </row>
          <row>
            <entry morerows="1">
              <para>Store</para>
            </entry>
            <entry>
              <para>Various, except no response from IOA</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, no machine check is
              received and the PE enters the EEH Stopped State</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>No response from IOA</para>
            </entry>
            <entry>
              <para>Ignore Store</para>
            </entry>
            <entry>
              <para>Invalid address, broken IOA, or configuration cycle to
              non-existent IOA; if EEH is implemented and enabled, the PE
              enters the EEH Stopped State</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>Processor</para>
            </entry>
            <entry>
              <para>Invalid address (addressing an &#8220;undefined&#8221;
              address area)</para>
            </entry>
            <entry>
              <para>Load or Store</para>
            </entry>
            <entry>
              <para>No response from system</para>
            </entry>
            <entry>
              <para>Machine check</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>I/O</para>
            </entry>
            <entry>
              <para>Memory</para>
            </entry>
            <entry>
              <para>DMA - either direction</para>
            </entry>
            <entry>
              <para>Various, including but not limited to:</para>

              <itemizedlist>
                <listitem>
                  <para>Invalid addr accepted by a bridge</para>
                </listitem>

                <listitem>
                  <para>TCE extent</para>
                </listitem>

                <listitem>
                  <para>TCE Page Mapping and Control mis-match or invalid
                  TCE</para>
                </listitem>
              </itemizedlist>

            </entry>
            <entry>
              <para>Machine check unless reportable directly to the IOA in a
              way that allows the IOA to signal the error to its device
              driver</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, no machine check is
              received and the PE enters the EEH Stopped State</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>I/O</para>
            </entry>
            <entry>
              <para>I/O</para>
            </entry>
            <entry>
              <para>DMA - either direction</para>
            </entry>
            <entry>
              <para>Various</para>
            </entry>
            <entry>
              <para>Machine check unless reportable to master of the transfer
              in a way that allows master to recover</para>
            </entry>
            <entry>
              <para>&#160;</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>I/O</para>
            </entry>
            <entry>
              <para>Invalid address</para>
            </entry>
            <entry>
              <para>DMA - either direction</para>
            </entry>
            <entry>
              <para>No response from any IOA</para>
            </entry>
            <entry>
              <para>PCI IOA</para>
              <para>master-aborts</para>
            </entry>
            <entry>
              <para>Signal device driver using an external interrupt</para>
            </entry>
          </row>
          <row>
            <entry>
              <para>PCI IOA</para>
            </entry>
            <entry>
              <para>-</para>
            </entry>
            <entry>
              <para>Any</para>
            </entry>
            <entry>
              <para>Internal, indicated by SERR# or ERR_FATAL</para>
            </entry>
            <entry>
              <para>SERR# or ERR_FATAL, causing machine check</para>
            </entry>
            <entry>
              <para>If EEH is implemented and enabled, no machine check is
              received and the PE enters the EEH Stopped State</para>
            </entry>
          </row>
        </tbody>
      </tgroup>
    </table>

    <para><emphasis role="bold">Implementation Note:</emphasis> IOAs should, whenever possible, detect the
    occurrence of PCI errors on DMA and report them via an external interrupt
    (for possible device driver recovery) or retry the operation. Since
    system state has not been lost, reporting these errors via a machine
    check to the CPUs is inappropriate. Some devices or device drivers may
    cause a catastrophic error. Systems which wish to recover from these
    types of errors should choose devices and device drivers which are
    designed to handle them correctly.</para>

  </section>

</section>