You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Linux-Architecture-Reference/LoPAR/sec_error_introduction.xml

86 lines
4.4 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2016, 2020 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="dbdoclet.50569337_66258">
<title>Introduction</title>
<para>RTAS provides a mechanism which helps OSs avoid the need for
platform-dependent code that checks for, or recovers from, errors or
exceptional conditions. The mechanism is used to return information about
hardware errors which have occurred as well as information about non-error
events, such as environmental conditions (for example, temperature or
voltage out-of-bounds) which may need OS attention. This permits RTAS to
pass hardware event information to the OS in a way which is abstracted from
the platform hardware. This mechanism primarily presents itself to the OS
via two RTAS functions,
<emphasis>event-scan</emphasis> and
<emphasis>check-exception</emphasis>, which are described further in
<xref linkend="dbdoclet.50569332_16852"/>. A further RTAS function,
<emphasis>rtas-last-error</emphasis>, is also provided to return
information about hardware failures detected specifically within an RTAS
call.</para>
<para>The
<emphasis>event-scan</emphasis> function is called periodically to check for
the presence or past occurrence of a hardware event, such as a soft failure
or voltage condition, which did not cause a program exception or interrupt
(for example, an ECC error detected and corrected by background scrubbing
activity). The
<emphasis>check-exception</emphasis> function is called to provide further
detail on what platform event has occurred when certain exceptions or
interrupts are signaled. The events reported by these two functions are
mutually exclusive on any given platform; that is, a platform may choose to
notify the OS of a particular event type either through
<emphasis>event-scan</emphasis> or through an interrupt and
<emphasis>check-exception</emphasis>, but not both.</para>
<para>Since firmware is platform-specific, it can examine hardware
registers, can often diagnose many kinds of hardware errors down to a root
cause, and may even perform some very limited kinds of error recovery on
behalf of the OS. The reporting format, described in this chapter, permits
firmware to report the type of error which has occurred, what entities in
the platform were involved in the error, and whether firmware has
successfully recovered from the error without the need for further OS
involvement. Firmware may not, in many cases, be able to determine all the
details of an error, so there are also returned values which indicate this
fact. Firmware may optionally provide extended error diagnostic
information, as described in
<xref linkend="dbdoclet.50569337_79682"/>.</para>
<para>The abstractions provided by this architecture enable the handling of
most platform errors and events without integrating platform-specific code
into each supported OS.</para>
<para><emphasis role="bold">Architecture Note:</emphasis> It is not a goal of the firmware to diagnose all
hardware failures. Most I/O device failures, for example, will be detected
and recovered by an associated device driver. Firmware attempts to
determine the cause of a problem and report what it finds, to aid the end
user (by providing meaningful diagnostic data for messages) and to prevent
the loss of error syndrome information. Firmware is never required to
correct any problem, but in some cases may attempt to do so. System vendors
who want more extensive error diagnosis may create OS error handlers which
contain specific hardware knowledge, or could use firmware to collect a
minimum set of error information which could then be used by diagnostics to
further analyze the cause of the error.</para>
</section>