You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Linux-Architecture-Reference/LoPAR/app_capi_error_handling.xml

364 lines
13 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2016, 2020 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xl="http://www.w3.org/1999/xlink"
version="5.0"
xml:lang="en"
xml:id="app_coherent_platform_facility_error_handling_recovery">
<title>Coherent Platform Facility Error Handling and Recovery</title>
<para>During the course of operation, a coherent platform function can encounter errors.
Some possible reason for errors are:</para>
<itemizedlist>
<listitem>
<para>Hardware recoverable and unrecoverable errors</para>
</listitem>
<listitem>
<para>Transient and over-threshold correctable errors</para>
</listitem>
</itemizedlist>
<para>This architecture is not meant to contain an exhaustive list of errors,
implementations can vary.</para>
<section xml:id="app_capi_error_states">
<title>Error States</title>
<informaltable frame="all" pgwide="1">
<tgroup cols="3">
<colspec colname="c1" colwidth="25*" align="center" />
<colspec colname="c2" colwidth="10*" align="center" />
<colspec colname="c2" colwidth="65*" />
<thead valign="middle">
<row>
<entry>
<para>
<emphasis role="bold">State</emphasis>
</para>
</entry>
<entry>
<para>
<emphasis role="bold">Value</emphasis>
</para>
</entry>
<entry align="center">
<para>
<emphasis role="bold">Description</emphasis>
</para>
</entry>
</row>
</thead>
<tbody valign="middle">
<row>
<entry>
<para>Normal</para>
</entry>
<entry>
<para>1</para>
</entry>
<entry>
<para>Coherent platform function is operating normally. This is the
default state.</para>
</entry>
</row>
<row>
<entry>
<para>Disabled</para>
</entry>
<entry>
<para>2</para>
</entry>
<entry>
<para>Coherent platform function is operating, but all process
execution is disabled.</para>
</entry>
</row>
<row>
<entry>
<para>Temporarily Unavailable</para>
</entry>
<entry>
<para>3</para>
</entry>
<entry>
<para>Platform has initiated error recovery and the coherent
platform function is temporarily not available.</para>
</entry>
</row>
<row>
<entry>
<para>Permanently Unavailable</para>
</entry>
<entry>
<para>4</para>
</entry>
<entry>
<para>Platform has encountered a fatal error with the coherent platform
function. Recovery is impossible, without partition reboot, DLPAR
re-assignment, system reboot or hardware replacement.</para>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
</section>
<section xml:id="app_coherent_platform_function_state_transitions">
<title>Coherent Platform Function State Transitions</title>
<informaltable frame="all" pgwide="1" >
<tgroup cols="5">
<colspec colname="c1" colwidth="20*" align="center" />
<colspec colname="c2" colwidth="20*" align="center" />
<colspec colname="c3" colwidth="20*" align="center" />
<colspec colname="c4" colwidth="20*" align="center" />
<colspec colname="c5" colwidth="20*" align="center" />
<thead valign="middle">
<row>
<entry morerows="1">
<para><emphasis role="bold">Initial State</emphasis></para>
</entry>
<entry namest='c2' nameend='c5'>
<para>Final State</para>
</entry>
</row>
<row>
<entry>
<para><emphasis role="bold">1</emphasis><?linebreak?>
<emphasis role="bold">Normal</emphasis></para>
</entry>
<entry>
<para><emphasis role="bold">2</emphasis><?linebreak?>
<emphasis role="bold">Disabled</emphasis></para>
</entry>
<entry>
<para><emphasis role="bold">3</emphasis><?linebreak?>
<emphasis role="bold">Temporarily Unavailable</emphasis></para>
</entry>
<entry>
<para><emphasis role="bold">4</emphasis><?linebreak?>
<emphasis role="bold">Permanently Unavailable</emphasis></para>
</entry>
</row>
</thead>
<tbody valign="middle">
<row>
<entry>
<para>1<?linebreak?>Normal</para>
</entry>
<entry>
<?dbhtml bgcolor="#87878a" ?><?dbfo bgcolor="#87878a" ?>
<para>&#160;</para>
</entry>
<entry>
<para>Platform initiated action</para>
</entry>
<entry>
<para>Platform initiated action during recovery /
H_CONTROL_<?linebreak?>CA_FUNCTION Read Error State</para>
</entry>
<entry>
<para>Platform detected permanent error</para>
</entry>
</row>
<row>
<entry>
<para>2<?linebreak?>Disabled</para>
</entry>
<entry>
<para>H_CONTROL_<?linebreak?>CA_FUNCTION / Reset or Partition Reboot /
DLPAR re-assignment</para>
</entry>
<entry>
<?dbhtml bgcolor="#87878a" ?><?dbfo bgcolor="#87878a" ?>
<para>&#160;</para>
</entry>
<entry>
<para>Platform initiated action</para>
</entry>
<entry>
<para>Platform detected permanent error</para>
</entry>
</row>
<row>
<entry>
<para>3<?linebreak?>Temporarily Unavailable</para>
</entry>
<entry>
<?dbhtml bgcolor="#dec69e" ?><?dbfo bgcolor="#dec69e" ?>
<para>Not a valid transition, must go through state 2</para>
</entry>
<entry>
<para>Platform initiated action after recovery</para>
</entry>
<entry>
<?dbhtml bgcolor="#87878a" ?><?dbfo bgcolor="#87878a" ?>
<para>&#160;</para>
</entry>
<entry>
<para>Platform detected permanent error</para>
</entry>
</row>
<row>
<entry>
<para>4<?linebreak?>Permanently Unavailable</para>
</entry>
<entry>
<para>Partition reboot or DLPAR reassignment</para>
</entry>
<entry>
<?dbhtml bgcolor="#dec69e" ?><?dbfo bgcolor="#dec69e" ?>
<para>Not a valid transition</para>
</entry>
<entry>
<para>H_CONTROL_<?linebreak?>CA_FACILITY / Reset</para>
</entry>
<entry>
<?dbhtml bgcolor="#87878a" ?><?dbfo bgcolor="#87878a" ?>
<para>&#160;</para>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
</section>
<section xml:id="app_general_error_recovery_procedure">
<title>General Error Recovery Procedure</title>
<para>The following flow is a description of the general error recovery steps
required to reset operation of the coherent platform function. This recovery
is initiated by platform firmware or after an error is detected by the OS.</para>
<orderedlist>
<listitem>
<para>If necessary, platform firmware freezes OS MMIO access for coherent
platform function, error information is col- lected and cached in platform
firmware for later retrieval by OS.</para>
</listitem>
<listitem>
<para>Platform firmware terminates and removes all processes and disables
coherent platform function if possible.</para>
</listitem>
<listitem>
<para>The error state for the coherent platform function changes to
Temporarily Unavailable.</para>
</listitem>
<listitem>
<para>Platform firmware resets and reconfigures hardware associated with
coherent platform function.</para>
</listitem>
<listitem>
<para>Platform firmware unfreezes OS MMIO access and sets coherent
platform function to Disabled.</para>
</listitem>
<listitem>
<para>OS calls H_CONTROL_CA_FUNCTION with operation of “Read Error State”
until it observes state of Disabled.</para>
</listitem>
<listitem>
<para>Optionally, OS collects any coherent platform function error data via
H_CONTROL_CA_FUNCTION with operation of “Get Error Buffer” and “Get Error
Log”.</para>
</listitem>
<listitem>
<para>OS calls H_CONTROL_CA_FUNCTION with operation of “Get Download Status”
in order to determine if a download is required for the coherent platform
function, if so, OS performs the download. After the download the coherent
platform function is still in Disabled error state.</para>
</listitem>
<listitem>
<para>OS calls <emphasis>ibm,update-nodes</emphasis> and
<emphasis>ibm,update-properties</emphasis> for the affected
coherent platform facility in order to receive current configuration values.</para>
</listitem>
<listitem>
<para>OS uses H_CONTROL_CA_FUNCTION with operation of “Reset” to change the
coherent platform function error state back to Normal.</para>
</listitem>
</orderedlist>
</section>
<section xml:id="app_os_application_detected_error">
<title>OS Application Detected Error</title>
<para>Application detects an error, using implementation dependent methods.</para>
<orderedlist>
<listitem>
<para>Application detects an error and determines a reset is necessary.</para>
</listitem>
<listitem>
<para>Application asks the OS to perform a reset to the AFU.</para>
</listitem>
<listitem>
<para>OS calls H_CONTROL_CA_FUNCTION with operation of “Reset”.</para>
</listitem>
<listitem>
<para>Platform firmware performs a reset to the coherent platform function.
See H_CONTROL_CA_FUNCTION with “Reset” operation for details.</para>
</listitem>
<listitem>
<para>OS receives return code from H_CONTROL_CA_FUNCTION indicating if the
reset was successful.</para>
</listitem>
<listitem>
<para>If necessary, the OS can performs a download of the coherent platform
function via H_DOWNLOAD_CA_FUNCTION.</para>
</listitem>
<listitem>
<para>OS then re-attachs process contexts as necessary and the application
resumes normal operation.</para>
</listitem>
<listitem>
<para>If Reset operation does not clear error, OS should signal serviceable
error to HMC and discontinue use of coherent platform function.</para>
</listitem>
</orderedlist>
</section>
<section xml:id="app_os_application_download">
<title>Application Download</title>
<para>There are some instances in which the OS would like to re-download a c
oherent platform function in operation. This could be due to unexpected
behavior or bad state.</para>
<orderedlist>
<listitem>
<para>OS resets the coherent platform function by calling
H_CONTROL_CA_FUNCTION with operation of “Reset”. The reset
clears the download state.</para>
</listitem>
<listitem>
<para>OS performs download to coherent platform function using
H_DOWNLOAD_CA_FUNCTION and after a successful download.</para>
</listitem>
<listitem>
<para>OS calls <emphasis>ibm,update-nodes</emphasis> and
<emphasis>ibm,update-properties</emphasis> for the affected
coherent platform facility in order to receive current configuration values.</para>
</listitem>
<listitem>
<para>OS can attach processes and proceed with operation.</para>
</listitem>
</orderedlist>
</section>
</appendix>