Coherent Platform Facility Error Handling and Recovery During the course of operation, a coherent platform function can encounter errors. Some possible reason for errors are: Hardware recoverable and unrecoverable errors Transient and over-threshold correctable errors This architecture is not meant to contain an exhaustive list of errors, implementations can vary.
Error States State Value Description Normal 1 Coherent platform function is operating normally. This is the default state. Disabled 2 Coherent platform function is operating, but all process execution is disabled. Temporarily Unavailable 3 Platform has initiated error recovery and the coherent platform function is temporarily not available. Permanently Unavailable 4 Platform has encountered a fatal error with the coherent platform function. Recovery is impossible, without partition reboot, DLPAR re-assignment, system reboot or hardware replacement.
Coherent Platform Function State Transitions Initial State Final State 1 Normal 2 Disabled 3 Temporarily Unavailable 4 Permanently Unavailable 1Normal   Platform initiated action Platform initiated action during recovery / H_CONTROL_CA_FUNCTION Read Error State Platform detected permanent error 2Disabled H_CONTROL_CA_FUNCTION / Reset or Partition Reboot / DLPAR re-assignment   Platform initiated action Platform detected permanent error 3Temporarily Unavailable Not a valid transition, must go through state 2 Platform initiated action after recovery   Platform detected permanent error 4Permanently Unavailable Partition reboot or DLPAR reassignment Not a valid transition H_CONTROL_CA_FACILITY / Reset  
General Error Recovery Procedure The following flow is a description of the general error recovery steps required to reset operation of the coherent platform function. This recovery is initiated by platform firmware or after an error is detected by the OS. If necessary, platform firmware freezes OS MMIO access for coherent platform function, error information is col- lected and cached in platform firmware for later retrieval by OS. Platform firmware terminates and removes all processes and disables coherent platform function if possible. The error state for the coherent platform function changes to Temporarily Unavailable. Platform firmware resets and reconfigures hardware associated with coherent platform function. Platform firmware unfreezes OS MMIO access and sets coherent platform function to Disabled. OS calls H_CONTROL_CA_FUNCTION with operation of “Read Error State” until it observes state of Disabled. Optionally, OS collects any coherent platform function error data via H_CONTROL_CA_FUNCTION with operation of “Get Error Buffer” and “Get Error Log”. OS calls H_CONTROL_CA_FUNCTION with operation of “Get Download Status” in order to determine if a download is required for the coherent platform function, if so, OS performs the download. After the download the coherent platform function is still in Disabled error state. OS calls ibm,update-nodes and ibm,update-properties for the affected coherent platform facility in order to receive current configuration values. OS uses H_CONTROL_CA_FUNCTION with operation of “Reset” to change the coherent platform function error state back to Normal.
OS Application Detected Error Application detects an error, using implementation dependent methods. Application detects an error and determines a reset is necessary. Application asks the OS to perform a reset to the AFU. OS calls H_CONTROL_CA_FUNCTION with operation of “Reset”. Platform firmware performs a reset to the coherent platform function. See H_CONTROL_CA_FUNCTION with “Reset” operation for details. OS receives return code from H_CONTROL_CA_FUNCTION indicating if the reset was successful. If necessary, the OS can performs a download of the coherent platform function via H_DOWNLOAD_CA_FUNCTION. OS then re-attachs process contexts as necessary and the application resumes normal operation. If Reset operation does not clear error, OS should signal serviceable error to HMC and discontinue use of coherent platform function.
Application Download There are some instances in which the OS would like to re-download a c oherent platform function in operation. This could be due to unexpected behavior or bad state. OS resets the coherent platform function by calling H_CONTROL_CA_FUNCTION with operation of “Reset”. The reset clears the download state. OS performs download to coherent platform function using H_DOWNLOAD_CA_FUNCTION and after a successful download. OS calls ibm,update-nodes and ibm,update-properties for the affected coherent platform facility in order to receive current configuration values. OS can attach processes and proceed with operation.