EEH Error Processing This appendix describes the architectural intent for EEH error processing. This appendix does not attempt to illustrate all possible scenarios, and other implementations are possible.
General Scenarios In general, the device driver recovery consists of issuing an ibm,read-slot-reset-state2 call prior to doing any recovery to determine if (1) the IOA is in the MMIO Stopped and DMA Stopped state (that is, that an error has occurred which has put it into this state), and (2) whether or not the PE has been reset by the platform in the process of entering the MMIO Stopped and DMA Stopped state, and then doing one of the following: Simplest approach: Reset the PE Reconfigure the PE Most general approach (detailed more in ): Release the PE for Load /Store Issue Load / Store instructions t o get any desired state information from the IOA Call the ibm,slot-error-detail RTAS call to get the platform error information Log the error information Reset the PE Reconfigure the PE Most robust (no reset unless necessary): Release the PE for Load /Store Issue Load / Store instructions to get any desired state information from the IOA Call the ibm,slot-error-detail RTAS call to get the platform error information Log the error information Device driver does IOA cleanup Release the PE for DMA and restart operations (no reset) In any scenario, after several retries of a recoverable operation, the OS may determine that further recovery efforts should cease. In such a case, calling ibm,slot-error-detail with Function 2 (Permanent Error), in addition to returning error information, marks that the PE is no longer accessible due to previous errors.
More Detail on the Most General Approach The following gives a more detailed look at scenario # in . This will be broken up into two groups of operations: error logging and error recovery. These scenarios assume that: The ibm,configure-pe RTAS call is implemented. The attempts at recovery stop when Max_Retries_Exceeded is true.
Error Logging If the device driver is going to capture internal IOA-specific information as a part of the error logging process or if the IOA controlled by the device driver requires a longer wait after reset than the normal PCI specified minimum wait time, then the device driver determines whether its IOA has been reset as a result of entering EEH Stopped State, by looking at the PE Recovery Info output of the ibm,read-slot-reset-state2 RTAS call. The OS or device driver insures that all MMIOs to the IOA(s) in the PE are finished. If the IOA requires longer wait after reset times than the specified minimum, and the PE was reset (see step #1) as a result of the EEH event, then wait the additional necessary time before continuing. The OS or device driver enables PE MMIOs by calling the ibm,set-eeh-option RTAS call with Function 2. The OS or device driver calls the ibm,configure-pe RTAS call. If the PCI fabric does not need configuring (the PE was not reset previous to the call or was reset but was previously configured with ibm,configure-pe ), then the call returns without doing anything, otherwise it attempts to configure the fabric up to but not including the endpoint IOA configuration registers. If an EEH event occurs as a result of probing during the ibm,configure-pe RTAS call that results in a reset of the PE, the PE will be returned in the PE state of 2. Software does not necessarily need to check this on return from the call. The case where this occurs is expected to be rare, and probably signals a non-transient error. In this case the software can continue on with the recovery phase of the EEH processing, and will eventually hit the same EEH event on further processing. If the PE was reset (see step #1) as a result of the EEH event, then if the device driver is going to gather IOA-specific information for logging, it needs to finish the configuration of the IOA PCI configuration registers, by restoring the PCI configuration space registers of the IOA(s) in the PE (for example, BARs, Memory Space Enable, etc.). If desired, the device driver gathers IOA-specific information via MMIOs, by doing MMIOs to its IOA. The OS or device driver calls ibm,slot-error-detail . Any data captured in step # is passed in the call. Note that maximum amount of data will be captured in some cases only when the ibm,slot-error-detail call is made with PE not in the MMIO Stopped State (as it should be in step #). If Max_Retries_Exceeded is true, then call ibm,slot-error-detail with Function 2 (Permanent Error). If Max_Retries_Exceeded is not true, then call ibm,slot-error-detail with Function 1(Temporary Error). The ibm,slot-error-detail RTAS call captures whatever PCI config space registers it can between the configuration address passed in the call and the system (PHB), and including at the configuration address and at the PHB, and returns them along with the device specific data in an error log in the return information from the call. This call may encounter another EEH event, in which case it returns what information it can in the call, with a Status of 0 (Success). The OS or device driver logs the log entry returned from the ibm,slot-error-detail RTAS call. If Max_Retries_Exceeded is not true, then the next step is PE Recovery, otherwise stop and mark the IOA(s) in the PE as unusable.
PE Recovery OS or device driver does a PE reset sequence. Note that this step is required even if the PE was reset as a result of the initial EEH event, because the error logging steps (for example, the ibm,configure-pe or ibm,slot-error-detail calls) could have encountered another EEH event. The device driver or OS calls ibm,set-slot-reset with Function 1 or 3 to activate the reset. The minimum reset active time is waited. The device driver or OS calls ibm,set-slot-reset with Function 0 to deactivate the reset. The minimum reset inactive to first configuration cycles is waited. If the IOA requires more than the standard PCI specified time, then wait that longer time, instead. The OS or device driver calls ibm,configure-pe. If an EEH event occurs as a result of probing during the ibm,configure-pe RTAS call that results in a reset of the PE, the PE will be returned in the PE state of 2. Software does not necessarily need to check this on return from the call. The case where this occurs is expected to be rare, and probably signals a non-transient error. In this case the software can continue on with the recovery phase of the EEH processing, and will eventually hit the same EEH event on further processing. The device driver restores the PCI configuration spaces of the IOA(s) in the PE. The device driver initializes the IOA for operations.