EEH Error Processing
This appendix describes the architectural intent for EEH error processing.
This appendix does not attempt to illustrate all possible scenarios, and other
implementations are possible.
General Scenarios
In general, the device driver recovery consists of issuing an
ibm,read-slot-reset-state2 call prior to doing any recovery
to determine if (1) the IOA is in the MMIO Stopped and DMA Stopped state (that is,
that an error has occurred which has put it into this state), and (2) whether or
not the PE has been reset by the platform in the process of entering the MMIO
Stopped and DMA Stopped state, and then doing one of the following:
Simplest approach:
Reset the PE
Reconfigure the PE
Most general approach (detailed more in ):
Release the PE for Load /Store
Issue Load / Store instructions t
o get any desired state information from the IOA
Call the ibm,slot-error-detail RTAS call to get the
platform error information
Log the error information
Reset the PE
Reconfigure the PE
Most robust (no reset unless necessary):
Release the PE for Load /Store
Issue Load / Store instructions
to get any desired state information from the IOA
Call the ibm,slot-error-detail RTAS call to get the
platform error information
Log the error information
Device driver does IOA cleanup
Release the PE for DMA and restart operations (no reset)
In any scenario, after several retries of a recoverable operation, the OS
may determine that further recovery efforts should cease. In such a case,
calling ibm,slot-error-detail with Function 2
(Permanent Error), in addition to returning error information, marks that the PE is
no longer accessible due to previous errors.
More Detail on the Most General Approach
The following gives a more detailed look at scenario #
in . This will be broken up into two groups of operations:
error logging and error recovery.
These scenarios assume that:
The ibm,configure-pe RTAS call is implemented.
The attempts at recovery stop when Max_Retries_Exceeded is true.
Error Logging
If the device driver is going to capture internal IOA-specific information
as a part of the error logging process or if the IOA controlled by the device driver
requires a longer wait after reset than the normal PCI specified minimum wait time,
then the device driver determines whether its IOA has been reset as a result of entering
EEH Stopped State, by looking at the PE Recovery Info output of
the ibm,read-slot-reset-state2 RTAS call.
The OS or device driver insures that all MMIOs to the IOA(s) in the PE are finished.
If the IOA requires longer wait after reset times than the specified minimum,
and the PE was reset (see step #1) as a result of the EEH event, then wait the
additional necessary time before continuing.
The OS or device driver
enables PE MMIOs by calling the ibm,set-eeh-option RTAS call
with Function 2.
The OS or device driver calls the ibm,configure-pe RTAS call.
If the PCI fabric does not need configuring (the PE was not reset previous to the
call or was reset but was previously configured with ibm,configure-pe ),
then the call returns without doing anything, otherwise it attempts to configure
the fabric up to but not including the endpoint IOA configuration registers.
If an EEH event occurs as a result of probing during the
ibm,configure-pe RTAS call that results in a
reset of the PE, the PE will be returned in the PE state of 2. Software
does not necessarily need to check this on return from the call. The case
where this occurs is expected to be rare, and probably signals a non-transient error.
In this case the software can continue on with the recovery phase of the EEH processing,
and will eventually hit the same EEH event on further processing.
If the PE was
reset (see step #1) as a result of the EEH event, then if the device driver is
going to gather IOA-specific information for logging, it needs to finish the
configuration of the IOA PCI configuration registers, by restoring the PCI
configuration space registers of the IOA(s) in the PE (for example, BARs,
Memory Space Enable, etc.).
If desired, the
device driver gathers IOA-specific information via MMIOs, by doing MMIOs to its IOA.
The OS or device driver calls ibm,slot-error-detail .
Any data captured in step # is passed in the
call. Note that maximum amount of data will be captured in some cases only when the
ibm,slot-error-detail call is made with PE not in the MMIO Stopped
State (as it should be in step #).
If Max_Retries_Exceeded is true, then call ibm,slot-error-detail
with Function 2 (Permanent Error).
If Max_Retries_Exceeded is not true, then call ibm,slot-error-detail
with Function 1(Temporary Error).
The ibm,slot-error-detail RTAS call captures whatever
PCI config space registers it can between the configuration address passed in the call
and the system (PHB), and including at the configuration address and at the PHB, and
returns them along with the device specific data in an error log in the return information
from the call. This call may encounter another EEH event, in which case it returns what
information it can in the call, with a Status of 0 (Success).
The OS or device driver logs the log entry returned from the
ibm,slot-error-detail RTAS call.
If Max_Retries_Exceeded is not true, then the next step is PE Recovery,
otherwise stop and mark the IOA(s) in the PE as unusable.
PE Recovery
OS or device driver does a PE reset sequence. Note that this step is
required even if the PE was reset as a result of the initial EEH event, because
the error logging steps (for example, the ibm,configure-pe
or ibm,slot-error-detail calls) could have encountered
another EEH event.
The device driver or OS calls ibm,set-slot-reset with
Function 1 or 3 to activate the reset.
The minimum reset active time is waited.
The device driver or OS calls ibm,set-slot-reset with
Function 0 to deactivate the reset.
The minimum reset inactive to first configuration cycles is waited.
If the IOA requires more than the standard PCI specified time, then wait that
longer time, instead.
The OS or device driver calls ibm,configure-pe.
If an EEH event occurs as a result of probing
during the ibm,configure-pe RTAS call that results in a reset
of the PE, the PE will be returned in the PE state of 2. Software does not necessarily
need to check this on return from the call. The case where this occurs is expected to
be rare, and probably signals a non-transient error. In this case the software can
continue on with the recovery phase of the EEH processing, and will eventually hit
the same EEH event on further processing.
The device driver restores the PCI configuration spaces of the IOA(s) in the PE.
The device driver initializes the IOA for operations.