Coherent Platform Facility Error Handling and Recovery
During the course of operation, a coherent platform function can encounter errors.
Some possible reason for errors are:
Hardware recoverable and unrecoverable errors
Transient and over-threshold correctable errors
This architecture is not meant to contain an exhaustive list of errors,
implementations can vary.
Error States
State
Value
Description
Normal
1
Coherent platform function is operating normally. This is the
default state.
Disabled
2
Coherent platform function is operating, but all process
execution is disabled.
Temporarily Unavailable
3
Platform has initiated error recovery and the coherent
platform function is temporarily not available.
Permanently Unavailable
4
Platform has encountered a fatal error with the coherent platform
function. Recovery is impossible, without partition reboot, DLPAR
re-assignment, system reboot or hardware replacement.
Coherent Platform Function State Transitions
Initial State
Final State
1
Normal
2
Disabled
3
Temporarily Unavailable
4
Permanently Unavailable
1Normal
Platform initiated action
Platform initiated action during recovery /
H_CONTROL_CA_FUNCTION Read Error State
Platform detected permanent error
2Disabled
H_CONTROL_CA_FUNCTION / Reset or Partition Reboot /
DLPAR re-assignment
Platform initiated action
Platform detected permanent error
3Temporarily Unavailable
Not a valid transition, must go through state 2
Platform initiated action after recovery
Platform detected permanent error
4Permanently Unavailable
Partition reboot or DLPAR reassignment
Not a valid transition
H_CONTROL_CA_FACILITY / Reset
General Error Recovery Procedure
The following flow is a description of the general error recovery steps
required to reset operation of the coherent platform function. This recovery
is initiated by platform firmware or after an error is detected by the OS.
If necessary, platform firmware freezes OS MMIO access for coherent
platform function, error information is col- lected and cached in platform
firmware for later retrieval by OS.
Platform firmware terminates and removes all processes and disables
coherent platform function if possible.
The error state for the coherent platform function changes to
Temporarily Unavailable.
Platform firmware resets and reconfigures hardware associated with
coherent platform function.
Platform firmware unfreezes OS MMIO access and sets coherent
platform function to Disabled.
OS calls H_CONTROL_CA_FUNCTION with operation of “Read Error State”
until it observes state of Disabled.
Optionally, OS collects any coherent platform function error data via
H_CONTROL_CA_FUNCTION with operation of “Get Error Buffer” and “Get Error
Log”.
OS calls H_CONTROL_CA_FUNCTION with operation of “Get Download Status”
in order to determine if a download is required for the coherent platform
function, if so, OS performs the download. After the download the coherent
platform function is still in Disabled error state.
OS calls ibm,update-nodes and
ibm,update-properties for the affected
coherent platform facility in order to receive current configuration values.
OS uses H_CONTROL_CA_FUNCTION with operation of “Reset” to change the
coherent platform function error state back to Normal.
OS Application Detected Error
Application detects an error, using implementation dependent methods.
Application detects an error and determines a reset is necessary.
Application asks the OS to perform a reset to the AFU.
OS calls H_CONTROL_CA_FUNCTION with operation of “Reset”.
Platform firmware performs a reset to the coherent platform function.
See H_CONTROL_CA_FUNCTION with “Reset” operation for details.
OS receives return code from H_CONTROL_CA_FUNCTION indicating if the
reset was successful.
If necessary, the OS can performs a download of the coherent platform
function via H_DOWNLOAD_CA_FUNCTION.
OS then re-attachs process contexts as necessary and the application
resumes normal operation.
If Reset operation does not clear error, OS should signal serviceable
error to HMC and discontinue use of coherent platform function.
Application Download
There are some instances in which the OS would like to re-download a c
oherent platform function in operation. This could be due to unexpected
behavior or bad state.
OS resets the coherent platform function by calling
H_CONTROL_CA_FUNCTION with operation of “Reset”. The reset
clears the download state.
OS performs download to coherent platform function using
H_DOWNLOAD_CA_FUNCTION and after a successful download.
OS calls ibm,update-nodes and
ibm,update-properties for the affected
coherent platform facility in order to receive current configuration values.
OS can attach processes and proceed with operation.