RTAS Error/Event Return Format

RTAS Error/Event Return Format This section describes in detail the return value retrieved by an RTAS call to either the event-scan or check-exception function. The return value consists of a fixed part and an optional Extended Error Report, described in the next section, which contains full details of the error. The fixed part is intended to allow reporting the most common problems in a simple way, which makes error detection and recovery simple for OSs that want to implement a very simple error handling strategy. At the same time, the mechanism is capable of providing full disclosure of the error syndrome information for OSs which have a more complete error handling strategy. RTAS can return at most one return code per invocation. If multiple conditions exist, RTAS returns them in descending order of severity on successive calls.

Reporting and Recovery Philosophy, and Description of Fields All firmware implementations use a common error and event reporting scheme, as described in detail below. It is not required that error recovery be present in firmware implementations, nor is it required that a high degree of error recovery or survival be undertaken by OSs. If such behavior is desired, then specific platform-dependent handlers can be loaded into the OS. However, this section defines return result codes and a philosophy which can be used if aggressive error handling is implemented in firmware. This section describes the fields of the Error Report format, and the philosophy which should be applied in generating return values from firmware or interpreting such return codes in an OS. In general, an OS would look at the Disposition field first to see if an error has been corrected already by firmware. If not corrected to the OS’s satisfaction, the OS would examine the Severity field. Based on that value, and optionally on any information it can use from the Type and other fields, the OS will make a determination of whether to continue or to halt operations. In either case, it may choose to log information regarding the error, using the remaining fields and optional Extended Error Log. The following sections describe the field values in .

Version This field is used only to distinguish among present and potential future formats for the remainder of the error report. This value will be incremented if extensions are made to the format described here. The primary function of this field is for future OSs to identify whether an error report may contain some (unknown at present) feature that was added after the initial version of this specification.

Severity This field represents the value judgment of firmware of how serious the problem being reported should be considered by the OS. Errors which are believed to represent a permanent hardware failure affecting the entire system are considered “FATAL.” OSs would not attempt to continue normal operation after receiving notice of such an error. OSs may not even be able to perform an orderly shutdown in the presence of a Fatal error, though they may make a policy decision to try. Less serious errors, but still causing a loss of data or state, are considered “ERRORs.” In general, continuing after such an error is questionable, since details of what has failed may not be available, or if available, may not map nicely onto any ongoing activity with which the OS can associate it. However, OSs may make a policy decision (for example, based on the error Type, the Initiator, or the Target) to continue operation after an Error. There are some types of errors, such as parity errors in memory or a parity error on a transfer between CPU and memory, which occur synchronously with the current process execution context. Such errors are sometimes fatal only to the current thread of execution; that is, they affect only the current CPU state and possibly that of any memory locations being currently referenced. If that context of execution is not essential to the system’s operation (for example, if an error trap mechanism is available in the OS and can be triggered to recover the OS to a known state), recovery and continuation may be possible. Or at least, since the memory of the machine is in an undamaged state, the system may be able to be brought down in an orderly fashion. Such errors are reported as having Severity “ERROR_SYNC”. It is OS dependent whether recovery is possible after such an error, or whether the OS will treat it as a fatal problem. The “WARNING” return value indicates that a non-state-losing error, either fully recovered by firmware or not needing recovery, has occurred. No OS action is required, and full operation is expected to continue unhindered by the error. Examples include corrected ECC errors or bus transfer failures which were re-tried successfully. The “EVENT” return value is the mechanism firmware uses to communicate event information to the OS. The event may have been detected by polling using event-scan or on the occurrence of an interrupt by calling check-exception. In either case, the Error Return value indicates the event which has occurred in the Type field. See the Type description below for a description of specific events and their expected handling. The “NO_ERROR” return value indicates that no error was present. In this case, the remainder of the Error Return fields are not valid and should not be referenced.

RTAS Disposition An aggressive firmware implementation may choose to attempt recovery for some classes of error so an OS can continue operation in the face of recoverable errors. If firmware detects an error for which it has recovery code, it attempts such action before it returns a value to the OS (that is, the mechanisms are linked in RTAS and cannot be separately accessed). Note that Disposition is nearly independent from Severity. Severity says how serious an error was, and Disposition says, regardless of severity, whether or not the OS has to even look at it. In general, an OS will first examine Disposition, then Severity. A return value of “FULLY RECOVERED” means that RTAS was able to completely recover the machine state after the error, and OS operation can continue unhindered. The severity of the problem in this case is irrelevant, though for consistency a “FATAL” error can never be “FULLY RECOVERED.” A return value of “LIMITED RECOVERY” means that RTAS was able to recover the state of the machine, but that some feature of the machine has been disabled or lost (for example, error checking), or performance may suffer (for example, a failing cache has been disabled). The RTAS implementation may return further information in the extended error log format regarding what action was done or what corrective action failed. In general, a conservative OS will treat this return the same as “NOT RECOVERED,” and initiate shutdown. A less conservative OS may choose to let the user decide whether to continue or to shut down. A value of “NOT RECOVERED” indicates that the RTAS either did not attempt recovery, or that it attempted recovery but was unsuccessful.

Optional Part Presence This is a single flag, valid only if the 32-bit Error Return value is located in memory, which indicates whether or not an Extended Error Log Length field and the Extended Error Log follows it in memory. It will be set on an in-memory return result from RTAS if and only if the RTAS call indicated sufficient space to return the Extended Error Log, and the RTAS implementation supports the Extended Error Log.

Initiator This field indicates, to the best ability of RTAS to determine it, the initiator of a failed transaction. (Note that in the “Initiator” field of , the value “I/O” indicates one of the defined I/O buses or IOAs. This field contains finer-grained details of which type of I/O bus failed, if known, and “UNKNOWN” if RTAS cannot tell.) In many of the newer LoPAR platforms, the platform error notification and handling flow is asynchronous to the OS and software execution flow, therefore the context of Initiator is not applicable to the platform firmware. In those cases, the value of “(0) Unknown or Not Applicable” is used for Initiator. In logs created with Version 6 or later, more detailed information about the error is provided in the Platform Event log format.

Target If RTAS can determine it, this field indicates the target of a failed transaction. In many of the newer LoPAR platforms, the platform error notification and handling flow is asynchronous to the OS and software execution flow, therefore the context of Target is not applicable to the platform firmware. In those cases, the value of “(0) Unknown or Not Applicable” is used for Target. In logs created with Version 6 or later, more detailed information about the error are provided in the Platform Event log format.

Type This field identifies the general type of the error or event. In some cases (for example, INTERN_DEV_FAIL), multiple possible events are grouped together under a common return value. In such cases, platform-aware software may use the Extended Error Log to distinguish them. Non-platform-aware software will generally treat all errors of a given type the same, so it generally will not need to access the Extended Error Log information. In the table, the EPOW values are associated with a Severity of EVENT. All other values will be associated with Severity values of FATAL, ERROR, ERROR_SYNC, or WARNING, and may or may not be corrected by RTAS. EPOW is an event type which indicates the potential loss of power or environmental conditions outside the limits of safe operation of the platform. See for more information. The “Platform Error (224)”, introduced for Version 6, generalizes that the error is identified by the platform and the specific details are encoded in the Platform Event log format itself. The “ibm,io-events (225)” defines a set of event notifications which requires special handling by the OS. For this type of event notification, the Platform Event Log contains the “IO Events” section which identifies additional details associated with the event. The “Platform information event (226)” indicates the return log should be logged as “Information Log”. These logs indicate key platform events and can be used for reference purposes. The “Resource deallocation event (227)” indicates an event notification to the OS that a specific hardware resource has experienced recurring recoverable errors with a trend toward unrecoverable. The OS should take action to deallocate the resource from usage to prevent unrecoverable errors. For these types of event notification, the Platform Event Log contains the “Logical Resource Identification” section which identifies the “Logical Entity” by Resource Type and Resource ID, associated with the event. The “Dump notification event (228)” indicates that a dump file is present in the platform and is available for retrieval by the OS. For this type of event notification, the Platform Event Log contains the “Dump Locator” section which contains additional event specific information. Additional Type values will be added in future revisions of the specification. If an OS does not recognize a particular event type, it can examine the severity first, and then choose to ignore the event if it is not serious.

Extended Event Log Length / Change Scope This optional 32-bit field is present in memory following the 32-bit Event Return value if the Optional Part Presence flag is “PRESENT”, and it indicates the length in bytes of the Extended Event Log information which immediately follows it in memory. The length does not include this field or the Event Return field, so it may be zero. The field is also present for a resource change “Hot Plug” event, such as a PRRN event, and then represents the scope of a resource change.

RTAS Event Return Format Fixed Part The summary portion of the error return is designed to fit into a single 32-bit integer. When used as a data return format in memory, an optional Length field and Extended Error Log data may follow the summary. The fixed part contains a “presence” flag which identifies whether an extended report is present. In the table below, the location of each field within the integer is included in parentheses after its name. Numerical field values are indicated in decimal unless noted otherwise. RTAS Event Return Format (Fixed Part) Bit Field Name (bit number(s)) Description, Values (Described in ) Version (0:7) A distinct value used to identify the architectural version of message. Severity (8:10) Severity level of error/event being reported: ALREADY_REPORTED (6) FATAL (5) ERROR (4) ERROR_SYNC (3) WARNING (2) EVENT (1) NO_ERROR (0) reserved for future use (7) RTAS Disposition (11:12) Degree of recovery which RTAS has performed prior to return after an error (value is FULLY_RECOVERED if no error is being reported): FULLY_RECOVERED(0) Note: Cannot be used when Severity is “FATAL”. LIMITED_RECOVERY(1) NOT_RECOVERED(2) reserved for future use (3) Optional_Part_Presence (13) Indicates if an Extended Error Log Length and Extended Error Log follows this 32-bit quantity in memory: PRESENT (1): The optional Extended Error Log is present. NOT_PRESENT (0): The optional Extended Error Log is not present. Reserved (14:15) Reserved for future use (0:3) Initiator (16:19) Abstract entity that initiated the event or the failed operation: UNKNOWN (0): Unknown or Not Applicable CPU (1): A CPU failure (in an MP system, the specific CPU is not differentiated here) PCI (2): PCI host bridge or PCI IOA Reserved -- do not reuse (3) MEMORY (4): Memory subsystem, including any caches Reserved -- do not reuse (5) HOT PLUG (6) Reserved for future use (7-15) Target (20:23) Abstract entity that was apparent target of failed operation (UNKNOWN if Not Applicable): Same values as Initiator field Type (24:31) General event or error type being reported: Internal Errors: RETRY (1): too many tries failed, and a retry count expired TCE_ERR (2): range or access type error in an access through a TCE INTERN_DEV_FAIL (3): some RTAS-abstracted device has failed (for example, TOD clock) TIMEOUT (4): intended target did not respond before a time-out occurred DATA_PARITY (5): Parity error on data ADDR_PARITY(6): Parity error on address CACHE_PARITY (7): Parity error on external cache ADDR_INVALID(8): access to reserved or undefined address, or access of an unacceptable type for an address ECC_UNCORR (9): uncorrectable ECC error ECC_CORR (10): corrected ECC error RESERVED (11-63): Reserved for future use Environmental and Power Warnings: EPOW(64): See Extended Error Log for sensor value RESERVED (65-95): Reserved for future use Reserved -- do not reuse (96-159) Platform Resource Reassignment (160) -- includes Change Scope in bits 32-63 Reserved for future use (through-223) Platform Error (224) (for Version 6 or later) ibm,io-events (225) (for Version 6 or later) Platform information event (226) (for Version 6 or later) Resource deallocation event (227) (for Version 6 or later) Dump notification event (228) (for Version 6 or later) Hot-plug-events (229) (for Version 6 or later) Vendor-specific events(230-255): Non-architected Other (0): none of the above Extended Event Log Length / Change Scope (32:63) Length in bytes of Extended Event Log information which follows (see ) OR the scope parameter to be input the ibm,update-nodes RTAS to retrieve the nodes that were changed by selected “Hot Plug” events.

Typically, most OSs care about, and have handlers for, only a few specific errors. Since coding of an error is unique in the above scheme, an OS can check for specific errors, then if nothing matches exactly, look at more generic parts of the error message. This permits generic error message generation for the user, providing the basic information that RTAS delivered to the OS. Platforms may provide more complete error diagnosis and reporting in RTAS, combined with off-line diagnostics which take advantage of the information reported from previous failures.

Version 6 Extensions of Event Log Format

RTAS General Extended Event Log Format, Version 6 The following section defines new extensions to the event log format which are identified by a Version number 0x06 in the first byte in the returned buffer (byte 0 of the fixed-part information). The following tables define extended error log formats for Version 6, by which the RTAS can optionally return detailed information to the software about a hardware error condition. Other versions will be defined in following sections of this chapter. This format is also intended to be usable as residual error log data in NVRAM, so that the OS could alternatively retrieve error data after an error event which caused a reboot. Platforms indicate the maximum length of the error log buffer in the “rtas-error-log-max” RTAS property in the OF device tree, so that the OS can allocate a buffer large enough to hold the extended error log data when calling the RTAS event-scan or check-exception functions. If the allocated buffer is not large enough to hold all the error log data, the data is truncated to the size of the buffer. Requirement and require that four bytes of the vendor-specific format contain a unique identifier for the company that has defined the format. The description of the “name” string in provides alternatives for defining this identifier. Examples of these unique identifiers include stock ticker labels and Organizationally Unique Identifiers (OUIs). Since the different options in IEEE 1275 provide different guarantees of uniqueness and different identifier lengths, the company should use its best judgement in selecting a unique identifier that fits the four character field. The length of this field is limited to 4 bytes to conserve available log data space. As an example, if Allied Information Monitoring (a fictional name for the purposes of this example) were to create a vendor-specific log format 12, then bytes 12-15 of such a log may contain “AIM<NULL>”. This identifier is intended to apply to the company that defines the specific format, and may be used by other companies that wish to be compatible with that format. For example, if another company wanted to take advantage of existing support in one of the OSs by using an AIM-specific error log format for logs generated on their own platform, their log would have to contain an identifier of “AIM<NULL>”. R1--1. Platforms which support Version 6 of the Extended Event Log Format must do so by including a 0x06 value in the first byte of the RTAS Event Return Format (Fixed Part) and using the formats described in (and all subsections under that section). Software Implementation Note: OSs running on platforms which support Version 6 of the Extended Event Log Format must ensure that the length parameter passed in the event-scan RTAS call be at least 2 KB. R1--2. If the length parameter on the RTAS event-scan call for returning data using Version 6 of the Extended Event Log Format is insufficient to return all the data the platform would otherwise make available, the platform must truncate the data by eliminating optional sections entirely rather than truncating a section. R1--3. All event logs returning a Version 6 Platform Event Log format must include the Main-A and Main-B Sections. Other sections are optional depending on the specific event type as specified in Requirement . R1--4. The following sections must be provided as indicated: For the Platform error Type, the Primary Service Reference Code (SRC) section must be provided. For the ibm,io-events Type, the IO Events section must be provided. For the Resource deallocation event Type, the Logical Resource Identification section must be provided. For the Dump notification event Type, the Dump Locator section must be provided. For the EPOW Type, the EPOW section must be provided. For the HOTPLUG Type, the Hotplug section must be provided. Software Implementation Note: All fields in the Platform Event Log marked “Platform specific information” or “Other platform specific information sections” contain information reserved for platform or platform Service Application use only. That information is not defined in this document. Information in these fields should be ignored by the OS. Software Implementation and Architecture Note: All fields currently marked “Reserved” are set to zero by RTAS and are ignored by the OS. The reserved values in the defined fields in the Platform Event Log may be defined in the future in this architecture document for platform specific usage without change to this architecture. RTAS General Extended Event Log Format, Version 6 Byte Bit Description 0 0 1 = Log Valid 1 1 = Unrecoverable Error 2 1 = Recoverable (correctable or successfully retried) Error 3 1 = Unrecoverable Error, Bypassed - Degraded operation (e.g. CPU/memory taken off-line, bad cache bypassed, etc.) 4 1 = Predictive Error - Error is recoverable, but indicates a trend toward unrecoverable failure (e.g. correctable ECC error threshold, etc.) 5 1 = “New” Log (always 1 for data returned from RTAS) 6 1 = Big-Endian 7 Reserved 1 0:7 Reserved 2 0 Set to 1 - (Indicating log is in PowerPC format) 1:3 Reserved 4:7 Log format indicator, defined format used for byte 12-2047: 0-13, 15 Reserved 14: Platform Event Log 3 0:7 Reserved 4-11 Reserved 12-15 Company identifier of the company that has defined the format for this vendor specific log type. 16-2047 Detail vendor specific log data. If byte 2, bits 4:7, above, are a value of 14 (Platform Event Log) and bytes 12-16 are “IBM ”, then see for the content of this field.

Platform Event Log Format, Version 6 This format is used when byte 2, bits 4:7, of the RTAS General Extended Event Log Version 6 are a value of 14 (Platform Event Log). Overview of Platform Event Log Format, Version 6 Byte Length in Bytes Description 12-15 4 Contains ASCII characters “IBM<NULL>”. 16-63 48 Main-A section (ID = 'PH'). Required section. See for the format. 64-87 24 Main-B section (ID = 'UH'). Required, always follow Main-A section. See for the format. 88-103 16 Logical Resource Identification section (ID = 'LR'). Optional, present only for Resource deallocation event notification. If present, this section always follows Main-B section. See for the format. 104- 80+ optional FRU call out sub-section Primary SRC section (ID = 'PS'). Required for “Platform Error” event type, optional for other event types. If present, this section always follows Main-B section. See for the format. 64 Dump Locator section (ID = 'DH') Optional, present only for dump event notification. If present, this section follows Main-B or Primary SRC section. See for the format. 20 EPOW section (ID = 'EP'). Optional, present only for “EPOW” interrupt event notification. If present, this section follows Main-B section. See for the format. Variable IO Events section (ID = 'IE'). Optional, present only for “ibm,io-events” interrupt event notification. If present, this section follows Main-B section. See for the format. 28 Failing Enclosure MTMS section (ID = 'MT'). Required for errors only. If present, this section follows Main-B section or Primary SRC. See for the format. 28 Impacted partition description section (ID = 'LP''). Required for errors only. If present, this section follows Main-B section or Primary SRC 40 Machine Check Interrupt section (ID = 'MC'). Optional for “Platform Error” event types with ERROR_SYNC severity caused by a machine check interrupt. If present, this section follows the Main-B. See . ??? Hotplug Section (ID = “HP”). Optional, present only for Hotplug event notification. If present, this section follows Main-B section. See . ...- 2047 Variable Other platform specific information sections. Optional.

Platform Event Log Format, Main-A Section Platform Event Log Format, Version 6, Main-A Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'PH' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 48 0x04 1 Section Version 0x05 1 Section sub-type 0x06 2 Creator Component ID 0x08 4 Log creation date in BCD format: YYYYMMDD, where YYYY = year, MM = month 01 - 12, DD = day 01 - 31. 0x0C 4 Log creation time in BCD format: HHMMSS00, where HH = hour 00 - 23, MM = minutes 00 - 59, SS = s econds 00 -5 9, 00 = hundredth of seconds 00 - 99. 0x10 8 Platform specific information 0x18 1 Creator ID -- subsystem creating the log entry represented as a single ASCII character 'E' = Service Processor 'H' = Hypervisor, 'W' = Power Control 'L' = Partition Firmware 0x19 2 Reserved 0x1B 1 Section count -- number of sections comprising log entry, including this section 0x1C 4 Reserved 0x20 8 Platform specific information 0x28 4 Platform Log ID (PLID) Unique identifier for a single event. Note that it is possible for multiple log entries to be made for a single error/event. The entries are linked to the same event by using the same PLID. 0x2C 4 Platform specific information

Platform Event Log Format, Main-B Section Platform Event Log Format, Version 6, Main-B Section Byte Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'UH' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 24 0x04 1 Section Version 0x05 1 Section subtype 0x06 2 Creator Component ID 0x08 1 Subsystem ID: For error events, this is the failing subsystem. For non-error events, this is the subsystem associated with the event. 0x10 - 0x1F = Processor subsystem including internal cache 0x20 - 0x2F = Memory subsystem including external cache 0x30 - 0x3F = I/O subsystem (hub, bridge, bus) 0x40 - 0x4F = I/O adapter, device and peripheral 0x50 - 0x5F = CEC hardware 0x60 - 0x6F = Power/Cooling subsystem 0x70 - 0x79 = Others subsystem 0x7A - 0x7F = Surveillance Error 0x80 - 0x8F = Platform Firmware 0x90 - 0x9F = Software 0xA0 - 0xAF = External environment 0xB0 - 0xFF = Reserved 0x09 1 Platform specific information 0x0A 1 Event/Error Severity (see additional description following the table) 0x00 = Informational or non- error Event. This field must be 0x00 for non-error event. Use Event Sub-type field to specify unique event. 0x1X = Recovered Error 0x10 = Recovered Error, general 0x14 = Recovered Error, spare capacity utilized 0x15 = Recovered Error, loss of entitled capacity 0x2X = Predictive Error 0x20 = Predictive Error, general 0x21 = Predictive Error, degraded performance 0x22 = Predictive Error, fault may be corrected after platform re-boot 0x23 = Predictive Error, fault may be corrected after boot, degraded performance 0x24 = Predictive Error, loss of redundancy 0x4X = Unrecoverable Error 0x40 = Unrecoverable Error, general 0x41 = Unrecoverable Error, bypassed with degraded performance 0x44 = Unrecoverable Error, bypassed with loss of redundancy 0x45 = Unrecoverable Error, bypassed with loss of redundancy and performance 0x48 = Unrecoverable Error, bypassed with loss of function 0x6X = Error on diagnostic test 0x60 = Error on diagnostic test, general 0x61 = Error on diagnostic test, resource may produce incorrect results All other values = reserved 0x0B 1 Event Sub-Type (primarily used when Event Severity = 0x00, see additional description following the table) 0x00 = not applicable. 0x01 = Miscellaneous, Information Only 0x08 = Dump Notification (Dump may also be reported on Error event) 0x10 = Previously reported error has been corrected by system 0x20 = System resources manually deconfigured by user 0x21 = System resources deconfigured by system due to prior error event 0x22 = Resource deallocation event notification 0x30 = Customer environmental problem has returned to normal (e.g. input power restored, ambient temperature back within limits) 0x40 = Concurrent Maintenance Event 0x60 = Capacity Upgrade Event 0x70 = Resource Sparing Event 0x80 = Dynamic Reconfiguration Event (generated by RTAS) 0xD0 = Normal system/platform shutdown or powered off 0xE0 = Platform powered off by user without normal shutdown (abnormal power off) All other values = reserved 0x0C 4 Platform specific information 0x10 2 Reserved 0x12 2 Error Action Flags (see additional description following the table) bit 0 (0x8000) = 1, Service Action (customer notification) Required bit 1 (0x4000) = 1, Hidden Error - exclusive with SA Required (bit 0) bit 2 (0x2000) = 1, Report Externally (send to HMC and hypervisor) bit 3 (0x1000) = 1, Don't report to hypervisor (only report to HMC) (only meaningful when (bit 2) Report Externally is set) bit 4 (0x0800) = 1, Call Home Required (only valid if (bit 0) SA Required is set) bit 5 (0x0400) = 1, Error Isolation Incomplete. Further analysis required. bit 6 (0x0200) = 1, Deprecated. bit 7 (0x0100) = 1, Reserved bit 8, 9 = Platform specific information bit 10-15 = Reserved 0x14 4 Reserved

Error/Event Severity This field indicates the severity of the error event and the impact of the error to the platform (if applicable). Non-error or Informational Event: This value indicates an event that is a non-error event. Informational or user action event log entries must use this value. The Event Type field provides additional event information. Recovered Error, general: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware, e.g. ECC, internal spare or redundancy, cache line delete, boot time array repair, etc. No service action is required for this type of error. In general, when this value is used, the Error Action Flags has the value of “Hidden Error”. An event log with this value is used primarily for error thresholding design and code debug or as a record to indicate error frequency or trend. Recovered Error, spare capacity utilized: This value indicates that an error on a resource has been recovered by utilizing another resource not currently assigned for use (spare). The failing component is to be considered permanently in an error state. For example, a faulty instruction on one processor may be checkpointed and loaded into a spare processor, continuing the operations of the faulty one. In this case the failing component is considered permanently in an error state. Recovered Error, loss of entitled capacity: This value indicates that an error on a resource has been recovered by utilizing another resource already in use by the system. The failing component is to be considered permanently in an error state. This results in a loss of capacity in the partition that receives the error. For example, a processor already in use may take over the operations of a faulty one. Loss of the faulty processor in the system then results in less capacity being available to the partition receiving the error event. Typically this event would have an event sub-type of “Resource deallocation event notification” and the revised amount of entitled capacity would be found in the Logical Resource Identification Section, Entitled Capacity field. Predictive Error, general: This value indicates an event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the frequency of the errors indicates a trend toward (or potential) platform unrecoverable error. A deferred service or repair action is required. The automatic platform recovery actions have no impact to system performance (e.g. ECC, CRC, etc.), or the impact is unknown. Predictive Error, degraded performance: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the frequency of the errors (i.e. over threshold) indicates a trend toward (or potential) platform unrecoverable error. A deferred service or repair action is required. The automatic platform recovery actions are impacting/degrading system performance. Predictive Error, fault may be corrected after platform re-boot: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the frequency of the errors (i.e. over threshold) indicates a trend toward (or potential) platform unrecoverable error. A deferred service or repair action is required. The hardware fault may be corrected after platform re-boot as part of the repair action. If the fault cannot be corrected after re-boot, then a part replacement is required. The automatic platform recovery actions have no impact to system performance (e.g. ECC, CRC, etc.), or the impact is unknown. Predictive Error, fault may be corrected after platform re-boot, degraded performance: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the frequency of the errors (i.e. over threshold) indicates a trend toward (or potential) platform unrecoverable error. A deferred service or repair action is required. The hardware fault may be corrected after platform re-boot as part of the repair action. If the fault cannot be corrected after re-boot, then a part replacement is required. The automatic platform recovery actions are impacting/degrading the system performance. Predictive Error, loss of redundancy: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the frequency of the errors (i.e. over threshold) caused a loss in hardware redundancy. Future error in this subsystem may causes platform unrecoverable error. A deferred service or repair action is required to restore redundancy. The loss of redundancy may or may not impact system performance. Unrecoverable Error, general: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. The hardware or platform resource with the error cannot be deconfigured from the system. If the error is intermittent or soft, the platform may be able to re-boot successfully and resume. A service or repair action is required as soon as possible to correct the error. Unrecoverable Error, bypassed with degraded performance: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform resource with the error has been deconfigured from the system. The platform can be IPLed or re-IPLed with the error bypassed. System performance is degraded due to the deconfigured platform resource(s) e.g. processor, cache, memory, etc. A deferred service or repair action is required. Unrecoverable Error, bypassed with loss of redundancy: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform resource with the error can be deconfigured from the system. The platform can be IPLed or re-IPLed with the error bypassed. The deconfigured platform resource(s) resulted in loss of redundancy (e.g. Redundant FSP with static fail-over) with no loss of system performance. A deferred service or repair action is required. Unrecoverable Error, bypassed with loss of redundancy + performance: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform resource with the error can be deconfigured from the system. The platform can be IPLed or re-IPLed with the error bypassed. The deconfigured platform resource(s) resulted in loss of redundancy and system performance. A deferred service or repair action is required. Unrecoverable Error, bypassed with loss of function: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform resource with the error can be deconfigured from the system. The platform can be IPLed or re-IPLed with the error bypassed. The deconfigured platform resource(s) resulted in loss of platform or system function. A deferred service or repair action is required. Error on diagnostic test, general: This value indicates an error event that is detected during a diagnostic test. Impact to the system is undefined or unknown. Error on diagnostic test, resource may produce incorrect results: This value indicates an error event that is detected during a diagnostic test. The error may produce incorrect computational results (e.g. processor floating point unit test error).

Event Sub-Type This field provides additional information on the non-error event type. Not applicable: This value is used when the event is associated with an error. Error/Event Severity field and SRC section provide additional error information. Miscellaneous, Information Only: This value is used when the event is “for information only” or the event description doesn't fit into any other defined values in this field. Dump Notification: This value is used by the hypervisor or partition firmware as a “Dump Notification” event to the OS that a dump file is present in the platform for retrieval by the OS. This value is used by the HMC as a “Dump Notification” event to the Service Application to indicate a dump file is present for transmission to the manufacturer. Previously reported error has been corrected by system: This value is used by the platform firmware to indicate that the error event that was previously reported has been corrected by the platform. On a subsequent platform boot, this event type is logged to indicate that the array was successfully repaired. System resources manually deconfigured by user: This value is used by the platform firmware to indicate that a subset of platform resource(s) was/were deconfigured due to user's request (e.g. via platform ASM menu). The deconfigured resource(s) is/are not associated with error detected by the platform. The event is a reminder to the user that the platform is running with partial capacity. Note: The platform provides this user option for platform performance testing purpose. System resources deconfigured by system due to prior error event: This value is used by the platform firmware to indicate that the platform is IPLed with resource(s) deconfigured due to error detected and reported previously. The event is a reminder to the user that the platform requires service. Resource deconfiguration notification: This value is used by partition firmware as an “Event Notification” to the OS that a specified resource (e.g. processor, memory page, etc.) currently used by the OS should be deallocated due to predictive error. A Logical Resource Identification section is included in the event log to indicate the Resource Type and ID. Customer environmental problem has returned to normal: This value is used by the platform firmware to indicate that a customer environmental problem (e.g. utility power, room ambient temperature, etc.) detected and reported previously, has returned to normal. Concurrent Maintenance: This value is used by the platform firmware to indicate any non-error event associated with concurrent maintenance activity. Capacity Upgrade Event: This value is used by the platform firmware to indicate any non-error event associated with capacity upgrade activity. Resource Sparing Event: This value is used by the platform firmware to indicate any non-error event associated with platform resource sparing activity. Dynamic Reconfiguration Event: This value is used by the partition firmware to indicate any significant but non-error event associated with dynamic reconfiguration activity. Implementation Note: Due to limited platform storage resource, non-error event log associated with a logical partition will be reported to the OS but may not be stored in the platform. Normal system/platform shutdown or powered off: This value is used by the platform firmware to indicate any non-error event associated with normal system/platform shutdown or powered off activity initiated by the user. Platform powered off by user without normal shutdown (abnormal powered off): This value is used by the platform firmware to indicate that the platform is abnormally powered off by the user.

Error Action Flags The following are the definitions of the actions taken for the various Error Action Flags. Report Externally - This flag instructs the service processor (error logger component) to send the error to the service application (e.g. service focal point(s) or FNM error analyzer). If this flag is set, the SP always sends the error: To the “managing HMC(s)” if one (or multiple) exists. And to the hypervisor (unless the “Don't report to hypervisor” flag is also set). Service Action Required - This flag instructs the Service application that some service action is required by either the customer or by the manufacturer’s service personnel. This is equivalent to saying Customer Notification is required. Contrast this flag with the “Call Home Required” flag. Call Home Required - This flag indicates that the error requires service and a Call Home Operation is to be performed. There are additional policies used in combination with this flag: what subsystem performs the Call Home, what is sent and where it is sent. Hidden Error - This flag allows errors to be placed in a partition's OS error log, but still remain hidden from the customer. This is a legacy function and the partition firmware for must filter errors marked “Hidden” and not forward these errors marked with this flag to the OS. Note that this flag has no impact on the SP reporting errors to either the HMC or hypervisor or for the hypervisor reporting errors to partitions. Don't report Error to hypervisor - While a partition is booting and before it is functional (e.g. no OS error logging available), partition errors may be sent through the hypervisor to the Service Processor). These partition errors (and only partition errors) may be marked with this flag to indicate that they need not be sent back to the hypervisor. This is due to the error scope being limited to the failing partition and the hypervisor has already taken the appropriate actions. Incomplete Information for Error Isolation - Some errors are not contained to a single enclosure and require error isolation from an entity with broader system view / scope. Software Error - This flag is used by the partition error logger to indicate to the error is most likely to be caused by the software. When both Software Error and Hardware Error flags are set, the error is caused by either software or hardware. The Software Error and Hardware Error flags are used to trigger the manufacturer’s support system to automatically download software or firmware fixes. Hardware Error - This flag is used by the partition error logger to indicate to the error is most likely to be caused by the hardware. The Software Error and Hardware Error flags are used to trigger the manufacturer’s support system to automatically download software or firmware fixes.

Platform Event Log Format, Logical Resource Identification section Platform Event Log Format, Version 6, Logical Resource Identification Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'LR' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 20 0x04 1 Section Version 0x05 1 Section subtype 0x06 2 Creator Component ID 0x08 1 Resource Type 0x10: Processor 0x11: Shared processor 0x40: Memory page 0x41: Memory LMB All other values = reserved 0x09 1 Reserved 0x0A 2 Entitled Capacity: Hundredths of a CPU (only used for Resource Type = Shared processor, value = 0x0000 for others) 0x0C 4 Logical CPU ID: for resource type = processor DRC Index, for resource type = memory LMB Memory Logical Address (bit 0-31), for resource type = memory page 0x10 4 Memory Logical Address (bit 32-64)), for resource type = memory page

Platform Event Log Format, Primary SRC Section Platform Event Log Format, Version 6, Primary SRC Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'PS' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 80 + optional FRU call out sub section 0x04 1 Section Version 0x05 1 Section Subtype 0x06 2 Creator component ID 0x08 1 SRC Version 0x09 1 SRC Flags bit 0:6 = Platform specific information bit 7 = 1: Additional/Optional sub-sections present 0x0A 6 Platform specific information 0x10 4 Extended Reference Code hex data word 2 (required) 0x14 4 Extended Reference Code hex data word 3 (optional) 0x18 4 Extended Reference Code hex data word 4 (optional) 0x1C 4 Extended Reference Code hex data word 5 (optional) 0x20 4 Extended Reference Code hex data word 6 (optional) 0x24 4 Extended Reference Code hex data word 7 (optional) 0x28 4 Extended Reference Code hex data word 8 (optional) 0x2C 4 Extended Reference Code hex data word 9 (optional) 0x30 32 Primary Reference Code: 32 byte ASCII character (required) Additional/Optional Sub section for FRU call out (present only for “Platform Error” event type) 0x00 1 Sub section ID = C0 for FRU call out 0x01 1 Platform specific information 0x02 2 Length of sub section: expressed in # of words (4 bytes), from Sub section ID field FRU call out structure length FRU call out 1 (see FRU call out structure format below) FRU call out 2 (call out 2-10 are optional) ... FRU call out 10 (maximum)

Platform Event Log Format, Version 6, FRU Call-out Structure Offset Length in Bytes Description 0x00 1 Call-out Structure length, in bytes including all fields, including this one. 0x01 1 Call-out Type / Flags bits 0-3: Call-out structure type 0b0010 = this structure bit 4 = 1 FRU Identity (ID) Substructure field included in this FRU Call-out structure bit 5 = 1 Other platform-only use substructure field present following FRU ID substructure bit 6-7 = 0b11: Other platform-only use substructure field present following FRU ID substructure 0x02 1 FRU Replacement or Maintenance Procedure Priority (expressed as an ASCII character, see additional description following the table) 'H' = High priority and mandatory call-out. 'M' = Medium priority. 'A' = Medium priority group A (1st group). 'B' = Medium priority group B (2nd group). 'C' = Medium priority group C (3rd group). 'L' = Low priority. 0x03 1 Length of Location Code field - must be a multiple of 4. 0x04 variable max=80 Location Code NULL terminated ASCII string. May be up to 80 characters including the NULL. Padded with extra NULLs to 4-byte boundary. FRU Identity Substructure follow: 0x00 2 Substructure Type (2 ASCII Characters) 'ID' = FRU Identity Substructure 0x02 1 Substructure length (variable, several optional fields - see flags below) 0x03 1 Flags bits 0-3: Failing component Type (see additional description following the table) 0b0000: reserved 0b0001: “normal” hardware FRU 0b0010: code FRU 0b0011: configuration error, configuration procedure required 0b0100: Maintenance Procedure required 0b1001: External FRU 0b1010: External code FRU 0b1011: Tool FRU 0b1100: Symbolic FRU 0b1111: Reserved for expansion all other values reserved bit 4 (0x08) = 0b1: FRU Stocking Part Number supplied (mutually exclusive with bit 6) bit 5 (0x04) = 0b1: CCIN supplied (only valid if bit 4 = 0b1) bit 6 (0x02) = 0b1: Maintenance procedure call out supplied (mutually exclusive with FRU p/n) bit 7 (0x01) = 0b1: FRU Serial Number supplied (only valid if bit 4 = 0b1) 0x04 8, if present FRU Stocking Part Number (VPD FN keyword) or Procedure ID This field is present if Flags bits 4 =0b1 or Flags bits 6 =0b1. It contains a NULL-terminated ASCII character string. If Flags bit 4 = 0b1, this field contains a 7ASCII character part number If Flags bit 6 = 0b1, this field contains a 5 ASCII character procedure ID 0x0C 4, if present CCIN (VPD CC keyword) (optional, only supplied if Part Number also supplied) This field is present if Flags bit 5 = 0b1. It contains the CCIN of the failing FRU (VPD CC keyword), represented as 4 ASCII characters (not a NULL-terminated string). 12, if present FRU Serial Number (VPD SE Keyword) (optional) This field is present if Flags bit 7 = 0b1. It contains the serial number of the failing FRU (VPD SE keyword), represented as a 12 ASCII characters (not a NULL-terminated string). End of FRU Identify Substructure variable Other platform used only substructure field

FRU Replacement or Maintenance Procedure Priority This field defines the service priority of the specific call-out, i.e., replacing the FRU part number or performing the maintenance procedure ID as given in the FRU/Procedure Identity substructure. Here are the priority descriptions: 'H' = High priority and mandatory call-out. Replacing the FRU (or performing the maintenance procedure) is mandatory. If multiple call-outs with 'H' priority are given, all must be replaced or performed as a group. 'M' = Medium priority. Replacing the FRU (or performing maintenance procedure) with 'M' priority one at a time in the order given after all call-outs prior to this one, if present, are performed. 'A' = Medium priority group A (1st group). Replacing all the FRUs with 'A' priority as a group after all call-outs prior to this group, if present, are performed. 'B' = Medium priority group B (2nd group). Replacing all the FRUs with 'B' priority as a group after all call-outs prior to this group, if present, are performed. 'C' = Medium priority group C (3rd group). Replacing all the FRUs with 'C' priority as a group after all call-outs prior to this group, if present, are performed. 'L' = Low priority. After performed all the prior call-outs, if present, and problem still persists, replacing the FRU with this priority one at a time in the order given. The list of FRU/Procedure call-outs in the “call-out” subsection of the SRC structure must be in order as defined above, i.e. High, Medium, Low. 'M' has the same medium priority level as 'A', 'B', or 'C' and a call out with 'M' priority can precede or follow 'A', 'B' or 'C'. A group call-out must be contiguous in the list. Within the medium priority level, follow the call-out order in the list A list without High or Medium priority is also valid.

Failing Component Type Description Normal Hardware FRU: Hardware FRU in the platform which the platform firmware or code can positively identify, and its VPD contains the part number and associated information. Code FRU: Some layer of platform firmware or OS code is suspected. The procedure ID field provides additional information about which code(s) is/are the potential problem. Configuration error: The problem may be related to how hardware or code is configured. For example, an adapter is plugged in a slot that cannot support it. The FRU could be a procedure or a symbolic FRU. The reason to use one of these is if the analysis can provide more information to the customer and service provider by giving a location code. Maintenance procedure required: Further isolation of the problem is required by performing the procedure as identified in the Procedure ID field. Procedures are designed to help to isolate problems and guide the service provider through identifying which FRUs to replace in which order. Symbolic FRU: Used for a single FRU where the analysis code knows exactly what the part is but there is no part number, or the part number cannot be pulled from VPD, or when there is something special (like a procedure) for acquiring the FRU or working with it. Examples are cables, or FRUs without VPD (so a part number cannot be filled in). The term “Symbolic” simply means “not an actual part number”. External FRU: A failing part(s) which is/are not in the system, e.g. attached storage sub-system, network hubs/switches, external drives like CD/DVD boxes. External Code: Code not running in the platform but is the potential source of the error. This could be something like storage subsystem code or even another system in the same cluster. Tool FRU: This is a special tool that will be required by one of the FRUs in the list. Tools are only added as FRUs when they are not part of the CE tool kit and therefore the repair action could be delayed if the CE did not know to bring it. Examples are Optical Cleaning Kits for fiber channel, and special tools for torque or reach or weight considerations.

Platform Event Log Format, Dump Locator Section Platform Event Log Format, Version 6, Dump Locator Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'DH' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 64 0x04 1 Section Version 0x05 1 Section Sub-Type 0x00 = Log truncated, complete log received by another service entity. 0x01 = FSP Dump 0x02 = Platform System Dump 0x03 = Reserved 0x04 = Power Subsystem Dump 0x05 = Platform Event Log Entry Dump (when distinguishing between dump types, the term “Log Dump” is typically used) 0x06 = Partition-initiated resource dump 0x07 = Platform-initiated resource dump All other values reserved 0x0 2 Creator component ID 0x08 4 Dump ID 0x0C 1 Flags bit 0 (0x80) = 0, Dump sent to partition bit 0 = 1, Dump sent to HMC bit 1 (0x40) = 0, File name in ASCII bit 1 = 1, Dump file name is hex bit 2 (0x20) = 1, Dump size field valid 0x0D 2 Reserved 0x0F 1 Length of OS assigned Dump ID field in bytes, must be multiple of 4. May be 0. 0x10 8 Dump Size 0x18 40 OS-Assigned Dump ID As the flag field indicates, this field may either be an ASCII string or a hex number. When an ASCII string (AIX, Linux, HMC), this is a NULL terminated ASCII string representing the dump file name (leaf name only, does not include path). Field may be up to 40 characters including the NULL.

Platform Event Log Format, EPOW Section Platform Event Log Format, Version 6, EPOW Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'EP' 0x02 2 Section length: Length in bytes of the section, including the section ID. 0x04 1 Section Version 0x05 1 Section subtype 0x06 2 Creator Component ID 0x08 1 EPOW Sensor Value (low-order 4 bits contain the action code). 0x09 1 EPOW Event Modifier (low-order 4 bits contain the event modifier value) 0x00 = Not applicable For EPOW sensor value = 3 0x01 = Normal system shutdown with no additional delay 0x02 = Loss of utility power, system is running on UPS/Battery 0x03 = Loss of system critical functions, system should be shutdown 0x04 = Ambient temperature too high All other values = reserved 0x0A 1 Extended Modifier for Section Version 2 and higher For EPOW Sensor Value = 3 0x00 System wide shutdown 0x01 Partition specific shutdown 0x02 - 0xFF Reserved All other situations Reserved = 0x00 0x0B 1 Reserved 0x0C 8 Platform specific reason code

Platform Event Log Format, IO Events Section Platform Event Log Format, Version 6, IO Events Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'IE' 0x02 2 Section length: Length in bytes of the section, including the section ID. 0x04 1 Section Version 0x05 1 Section subtype 0x06 2 Creator Component ID 0x08 1 IO-Event Type: 0x01 = Error Detected 0x02 = Error Recovered 0x03 = Event 0x04 = RPC Pass Through All other values = Reserved 0x09 1 Offset 0x10 Field Length: For IO Event Type of RPC Pass Through, this field specifies the length of the data field which begins at offset 0x10, otherwise the value in this field is 0. Must be a multiple of 4 to maintain 4-byte alignment. 0x0A 1 Error/Event Scope: 0x00 = Not Applicable (use for IO-Event type 0x02, 0x03, 0x04) 0x36 = Reserved 0x37 = Reserved 0x38 = PHB 0x39 = Reserved 0x3A = Reserved 0x3B = Reserved 0x51 = Service Processor All other values = Reserved 0x0B 1 I/O-Event Sub-Type: 0x00 = Not Applicable (use for IO-Event type 0x01, 0x02, 0x04) 0x01 = Rebalance request 0x03 = Node online 0x04 = Node off-line 0x05 = platform-dump-max-size change 0x08 = Generic Notification 0x09 = Platform protection of NVDIMM contents enabled 0x0A = Platform protection of NVDIMM contents disabled All other values = Reserved 0x0C 4 DRC Index 0x10 0-216 For the RPC Pass Through IO Event Type: RPC data. Variable length data. Must be padded to 4 bytes alignment. For the platform-dump-max-size change I/O-Event Sub-Type: 8 bytes for the new value of the platform-dump-max-size system parameter (specifying the sum (in bytes) of the maximum size of each unique platform dump type that the ibm,platform-dump RTAS call could return). For Generic Notification I/O Event Sub-Type: Scoped Data Generic Notification Event Section. Must be padded to 4 bytes alignment.

Platform Event Log Format, Failing Enclosure MTMS Platform Event Log Format, Version 6, Failing Enclosure MTMS Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'MT' 0x02 2 Section length: Length in bytes of the section, including the section ID. value = 28 0x04 1 Section Version 0x05 1 Section subtype 0x06 2 Creator Component ID 0x08 8 Machine Type and Model: 8 ASCII characters, in the form “tttt-mmm”, where tttt = Machine Type and mmm = Model Number 0x10 12 Serial Number: 12 ASCII characters (If less than 12 characters are used, string is left justified (stored in the field starting with the lowest address) and padded with NULLs.)

The Failing Enclosure Machine Type, Model, and Serial Number (MTMS) that is associated with the error is important for service and support. The source of information for the MTMS fields varies according to the following: For CEC errors, it is the CEC enclosure MTMS. For errors in I/O enclosures (drawers and towers) that have their own MTMS and are sold as separate MTMS from the CEC, we use the I/O Drawer MTMS. For I/O enclosures that were sold as a feature, this section contains the Feature Code and Serial Number of the I/O enclosure. When the Feature Code is used, it is left justified in the Machine Type and Model field.

Platform Event Log Format, Impacted Partitions Platform Event Log Format, Version 6, Impacted Partitions Offset Length Byte 0 Byte 1 Byte 2 Byte 3 0 8 Section Header 0x10 4 Primary Partition ID Length of LP name (must be a multiple of 4) Target LP Count 0x14 4 Logical Partition ID 0x18 variable Primary Partition (LP) Name Null terminated ASCII string, padded to 4-Byte boundary variable Target LP 1 Target LP 2 and so on (padded to a 4-Byte boundary)

This section describes partitions that are impacted by an error. When this section is supplied, the partitions in this list (and only these partitions) are notified of the error.

Platform Event Log Format, Failing Memory Address Platform Error Event Log Format, Version 6, Failing Memory Address Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. value = 'MC' 0x02 2 Section length: Length in bytes of the section, including the section ID value = 32 0x04 1 Section Version 0x05 1 Section Subtype 0x06 2 Creator Component ID 0x08 4 FRU ID -- Identifies the FRU on which the machine check interrupt occurred 0x0C 4 Processor ID -- identifies the physical CPU on which the machine check occurred 0x10 1 Type of machine check interrupt 0x00 = Uncorrectable Memory Error (UE) 0x01 = SLB error 0x02 = ERAT Error 0x04 = TLB error 0x05 = D-Cache error 0x07 = I-Cache error 0x11 23 Information specific to machine check interrupt type. This section is binary zeroes if the platform does not provide specific information for the type of interrupt.

UE Error Information Offset Length in Bytes Description 0x11 1 Type of UE Bit 0 = 0 Permanent UE. The UE may be cleared with a DCBZ instruction. Bit 0 =1 Transient UE. The UE cannot be cleared with a DCBZ instruction. The contents of the entire logical page are not accessible for this type of UE 64 bit effective address is provided Bit 1 = 0 64 bit effective address is not provided by the log Bit 1 = 1 64 bit effective address is provided by the log. Offset 0x18 provides the effective address if this bit is 1 64 bit logical address is provided Bit 2 = 0 64 bit logical address of logical page is not provided by the log Bit 2 = 1 64 bit logical address of logical page is provided by the log. Offset 0x20 provides the logical address of the page if this bit is 1 Bit3-4 Reserved Bit5-7 Type of UE machine check interrupt. The value of the field is 0b000 for a permanent UE 0b000 = Platform cannot determine the processor unit that detected the error 0b001 = Error detected by instruction fetch unit of the processor 0b010 = Error during page table search for instruction fetch 0b011 = Error detected by load/store unit of the processor 0b100 = Error detected during page table search for load/store type of instruction All other values are reserved. 0x12 6 Reserved 0x18 8 64 bit effective address 0x20 8 64 bit logical address

SLB Error Information Offset Length in Bytes Description 0x11 1 64 bit effective address is provided Bit 0 = 0 64 bit effective address not provided by the log Bit 0 =1 64 bit effective address provided by the log. Offset 0x18 provides the effective address if bit 0 is1 Bit1-5 Reserved Bit 6-7 Type of SLB error 0b00 = Parity error in the SLB array or on the access path to the SLB 0b01 = Multiple hit error. There are two or more entries in the SLB that translate the same effective address 0b10 = Multiple hit error or parity error. Platform does not have enough information to disambiguate between the two cases. All other values are reserved. 0x12 6 Reserved 0x18 8 64 bit effective address 0x20 8 Reserved

ERAT Error Information Offset Length in Bytes Description 0x11 1 64 bit effective address is provided Bit 0 = 0 64 bit effective address not provided by the log Bit 0 =1 64 bit effective address provided by the log. Offset 0x18 provides the effective address if bit 0 is1 Bit 1-5 Reserved Bit 6-7 Type of ERAT error 0b01 = Parity error in the ERAT array 0b10 = Multiple hit error. There are two or more entries in the ERAT array that translate the same effective address 0b11 = Multiple hit error or parity error in the ERAT array. Platform does not have enough information to disambiguate between the two cases. All other values are reserved. 0x12 6 Reserved 0x18 8 64 bit effective address 0x20 8 Reserved

TLB Error Information Offset Length in Bytes Description 0x11 1 64 bit effective address is provided Bit 0 = 0 64 bit effective address not provided by the log Bit 0 =1 64 bit effective address provided by the log. Offset 0x18 provides the effective address if bit 0 is1 Bit 1-5 Reserved Bit 6-7 Type of TLB error 0b01 = Parity error in the TLB array 0b10 = Multiple hit error. There are two or more entries in the TLB that translate the same effective address 0b11= Multiple hit error or parity error in the TLB array. Platform does not have enough information to disambiguate between the two cases. All other values are reserved. 0x12 6 Reserved 0x18 8 64 bit effective address 0x20 8 Reserved

For an error log that has the machine check interrupt section filled out, the platform is not required to provide the date and time stamp in the main-a section. The fields will be binary zeroes if the date and time stamp is not provided.

Platform Event Log Format, Hotplug Section Platform Error Event Log Format, Version 6, Hotplug Section Offset Length in Bytes Description 0x00 2 Section ID: A two-ASCII character field which uniquely identifies the type of section. Value = “HP”. 0x02 2 Section length: Length in bytes of the section, including the section ID. 0x04 1 Section Version 0x05 1 Section subType 0x06 2 Creator Component ID 0x08 1 Hotplug Resource Type. 0x01 = CPU 0x02 = Memory 0x03 = SLOT 0x04 = PHB 0x05 = PCI 0x09 1 Hotplug Action 0x01 = Add 0x02 = Remove 0x0A 1 Hotplug Identifier Type 0x01 = drc name, resource is identified by drc name 0x02 = drc index, resource is identified by drc index 0x03 = drc count, number of resources to act upon 0x04 = drc count indexed, number of resources to act upon beginning at the specified drc index 0x0B 1 Bit Hotplug Event Capability Description 0 1 = Hotplug Token Present 1 0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug Identifier, the OS should take steps to verify the entirety of the request can be satisfied before proceeding with the hotplug / unplug operations. If only a partial count can be satisfied, the OS should ignore the entirety of the request. If the OS cannot determine this beforehand, it should satisfy the hotplug / unplug request for as many of the requested resources as possible, and attempt to revert to the original OS / DRC state. 1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible, even if it cannot be satisfied for all the DRCs specified. 2:7 Reserved 0x0C Variable Hotplug Identifier Variable length field depending on the Hotplug Identifier Type specified. For drc name, this field is a null-terminated ASCII character field containing the drc name of the resource to hotplug. For drc index, this is 4 byte field with the drc index of the resource to hotplug. For drc count, this is a 4 byte field with the number of resources to hotplug. For drc count indexed, this is two 4 byte fields the first being the number of resources to hotplug and the second being the drc index at which to start. (Section Length - 4) 4 Hotplug Token Present only if corresponding Hotplug Event Capability bit is set. Integer value that can be used in conjunction with other fields of the hotplug event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate hotplug event with the request which generated it for the purposes of providing feedback to the requestor, such as debugging or error information.