From edb08c49e897492d8fe0cade865d6922b1819917 Mon Sep 17 00:00:00 2001 From: Jeff Scheel Date: Thu, 16 Apr 2020 21:22:26 -0500 Subject: [PATCH] Add 'OCC online/offline' events to 'IE' error log subsection Signed-off-by: Jeff Scheel --- ...sec_rtas_error_reporting_return_format.xml | 583 +++++++++--------- 1 file changed, 292 insertions(+), 291 deletions(-) diff --git a/Error Handling/sec_rtas_error_reporting_return_format.xml b/Error Handling/sec_rtas_error_reporting_return_format.xml index 771aaf9..7a33768 100644 --- a/Error Handling/sec_rtas_error_reporting_return_format.xml +++ b/Error Handling/sec_rtas_error_reporting_return_format.xml @@ -1,7 +1,7 @@ -
RTAS Error/Event Return Format - + This section describes in detail the return value retrieved by an - RTAS call to either the - event-scan or + RTAS call to either the + event-scan or check-exception function. - + The return value consists of a fixed part and an optional Extended Error Report, described in the next section, which contains full details of the error. The fixed part is intended to allow reporting the most @@ -36,15 +36,15 @@ strategy. At the same time, the mechanism is capable of providing full disclosure of the error syndrome information for OSs which have a more complete error handling strategy. - + RTAS can return at most one return code per invocation. If multiple conditions exist, RTAS returns them in descending order of severity on successive calls. - +
Reporting and Recovery Philosophy, and Description of Fields - + All firmware implementations use a common error and event reporting scheme, as described in detail below. It is not required that error recovery be present in firmware implementations, nor is it required that @@ -56,7 +56,7 @@ Report format, and the philosophy which should be applied in generating return values from firmware or interpreting such return codes in an OS. - + In general, an OS would look at the Disposition field first to see if an error has been corrected already by firmware. If not corrected to the OS’s satisfaction, the OS would examine the Severity field. @@ -65,35 +65,35 @@ continue or to halt operations. In either case, it may choose to log information regarding the error, using the remaining fields and optional Extended Error Log. - - The following sections describe the field values in + + The following sections describe the field values in . - +
Version - + This field is used only to distinguish among present and potential future formats for the remainder of the error report. This value will be incremented if extensions are made to the format described here. The primary function of this field is for future OSs to identify whether an error report may contain some (unknown at present) feature that was added after the initial version of this specification. - +
- +
Severity - + This field represents the value judgment of firmware of how serious the problem being reported should be considered by the OS. - + Errors which are believed to represent a permanent hardware failure affecting the entire system are considered “FATAL.” OSs would not attempt to continue normal operation after receiving notice of such an error. OSs may not even be able to perform an orderly shutdown in the presence of a Fatal error, though they may make a policy decision to try. - + Less serious errors, but still causing a loss of data or state, are considered “ERRORs.” In general, continuing after such an error is questionable, since details of what has failed may not be @@ -101,7 +101,7 @@ with which the OS can associate it. However, OSs may make a policy decision (for example, based on the error Type, the Initiator, or the Target) to continue operation after an Error. - + There are some types of errors, such as parity errors in memory or a parity error on a transfer between CPU and memory, which occur synchronously with the current process execution context. Such errors are @@ -116,33 +116,33 @@ are reported as having Severity “ERROR_SYNC”. It is OS dependent whether recovery is possible after such an error, or whether the OS will treat it as a fatal problem. - + The “WARNING” return value indicates that a non-state-losing error, either fully recovered by firmware or not needing recovery, has occurred. No OS action is required, and full operation is expected to continue unhindered by the error. Examples include corrected ECC errors or bus transfer failures which were re-tried successfully. - + The “EVENT” return value is the mechanism firmware uses to communicate event information to the OS. The event may have been - detected by polling using + detected by polling using event-scan or on the occurrence of an interrupt by - calling + calling check-exception. In either case, the Error Return value indicates the event which has occurred in the Type field. See the Type description below for a description of specific events and their expected handling. - + The “NO_ERROR” return value indicates that no error was present. In this case, the remainder of the Error Return fields are not valid and should not be referenced. - +
- +
RTAS Disposition - + An aggressive firmware implementation may choose to attempt recovery for some classes of error so an OS can continue operation in the face of recoverable errors. If firmware detects an error for which it has @@ -152,13 +152,13 @@ Severity says how serious an error was, and Disposition says, regardless of severity, whether or not the OS has to even look at it. In general, an OS will first examine Disposition, then Severity. - + A return value of “FULLY RECOVERED” means that RTAS was able to completely recover the machine state after the error, and OS operation can continue unhindered. The severity of the problem in this case is irrelevant, though for consistency a “FATAL” error can never be “FULLY RECOVERED.” - + A return value of “LIMITED RECOVERY” means that RTAS was able to recover the state of the machine, but that some feature of the machine has been disabled or lost (for example, error checking), or @@ -169,36 +169,36 @@ “NOT RECOVERED,” and initiate shutdown. A less conservative OS may choose to let the user decide whether to continue or to shut down. - + A value of “NOT RECOVERED” indicates that the RTAS either did not attempt recovery, or that it attempted recovery but was unsuccessful. - +
- +
Optional Part Presence - + This is a single flag, valid only if the 32-bit Error Return value is located in memory, which indicates whether or not an Extended Error Log Length field and the Extended Error Log follows it in memory. It will be set on an in-memory return result from RTAS if and only if the RTAS call indicated sufficient space to return the Extended Error Log, and the RTAS implementation supports the Extended Error Log. - +
- +
Initiator - + This field indicates, to the best ability of RTAS to determine it, the initiator of a failed transaction. (Note that in the - “Initiator” field of + “Initiator” field of , the value “I/O” indicates one of the defined I/O buses or IOAs. This field contains finer-grained details of which type of I/O bus failed, if known, and “UNKNOWN” if RTAS cannot tell.) - + In many of the newer LoPAR platforms, the platform error notification and handling flow is asynchronous to the OS and software execution flow, therefore the context of Initiator is not applicable to @@ -206,15 +206,15 @@ Not Applicable” is used for Initiator. In logs created with Version 6 or later, more detailed information about the error is provided in the Platform Event log format. - +
- +
Target - + If RTAS can determine it, this field indicates the target of a failed transaction. - + In many of the newer LoPAR platforms, the platform error notification and handling flow is asynchronous to the OS and software execution flow, therefore the context of Target is not applicable to the @@ -222,12 +222,12 @@ Applicable” is used for Target. In logs created with Version 6 or later, more detailed information about the error are provided in the Platform Event log format. - +
- +
Type - + This field identifies the general type of the error or event. In some cases (for example, INTERN_DEV_FAIL), multiple possible events are grouped together under a common return value. In such cases, @@ -235,17 +235,17 @@ them. Non-platform-aware software will generally treat all errors of a given type the same, so it generally will not need to access the Extended Error Log information. - + In the table, the EPOW values are associated with a Severity of EVENT. All other values will be associated with Severity values of FATAL, ERROR, ERROR_SYNC, or WARNING, and may or may not be corrected by RTAS. - + EPOW is an event type which indicates the potential loss of power or environmental conditions outside the limits of safe operation of the - platform. See + platform. See for more information. - + The “Platform Error (224)”, introduced for Version 6, generalizes that the error is identified by the platform and the specific details are encoded in the Platform Event log format itself. @@ -254,12 +254,12 @@ event notification, the Platform Event Log contains the “IO Events” section which identifies additional details associated with the event. - + The “Platform information event (226)” indicates the return log should be logged as “Information Log”. These logs indicate key platform events and can be used for reference purposes. - + The “Resource deallocation event (227)” indicates an event notification to the OS that a specific hardware resource has experienced recurring recoverable errors with a trend toward @@ -269,23 +269,23 @@ Identification” section which identifies the “Logical Entity” by Resource Type and Resource ID, associated with the event. - + The “Dump notification event (228)” indicates that a dump file is present in the platform and is available for retrieval by the OS. For this type of event notification, the Platform Event Log contains the “Dump Locator” section which contains additional event specific information. - + Additional Type values will be added in future revisions of the specification. If an OS does not recognize a particular event type, it can examine the severity first, and then choose to ignore the event if it is not serious. - +
- +
Extended Event Log Length / Change Scope - + This optional 32-bit field is present in memory following the 32-bit Event Return value if the Optional Part Presence flag is “PRESENT”, and it indicates the length in bytes of the @@ -294,24 +294,24 @@ may be zero. The field is also present for a resource change “Hot Plug” event, such as a PRRN event, and then represents the scope of a resource change. - +
- +
RTAS Event Return Format Fixed Part - + The summary portion of the error return is designed to fit into a single 32-bit integer. When used as a data return format in memory, an optional Length field and Extended Error Log data may follow the summary. The fixed part contains a “presence” flag which identifies whether an extended report is present. - + In the table below, the location of each field within the integer is included in parentheses after its name. Numerical field values are indicated in decimal unless noted otherwise. - RTAS Event Return Format (Fixed Part) + <title>RTAS Event Return Format (Fixed Part) <emphasis>(Continued)</emphasis> @@ -326,7 +326,7 @@ - Description, Values (Described in + Description, Values (Described in ) @@ -480,9 +480,9 @@ Length in bytes of Extended Event Log information which - follows (see + follows (see ) OR the scope - parameter to be input the + parameter to be input the ibm,update-nodes RTAS to retrieve the nodes that were changed by selected “Hot Plug” events. @@ -491,7 +491,7 @@
- + Typically, most OSs care about, and have handlers for, only a few specific errors. Since coding of an error is unique in the above scheme, an OS can check for specific errors, then if nothing matches exactly, @@ -500,17 +500,17 @@ that RTAS delivered to the OS. Platforms may provide more complete error diagnosis and reporting in RTAS, combined with off-line diagnostics which take advantage of the information reported from previous failures. - +
- +
Version 6 Extensions of Event Log Format
RTAS General Extended Event Log Format, Version 6 - + The following section defines new extensions to the event log format which are identified by a Version number 0x06 in the first byte in the returned buffer (byte 0 of the fixed-part information). The following @@ -520,23 +520,23 @@ sections of this chapter. This format is also intended to be usable as residual error log data in NVRAM, so that the OS could alternatively retrieve error data after an error event which caused a reboot. - + Platforms indicate the maximum length of the error log buffer in - the + the “rtas-error-log-max” RTAS property in the OF device tree, so that the OS can allocate a buffer large enough to hold - the extended error log data when calling the RTAS - event-scan or + the extended error log data when calling the RTAS + event-scan or check-exception functions. If the allocated buffer is not large enough to hold all the error log data, the data is truncated to the size of the buffer. - - Requirement - and + + Requirement + and require that four bytes of the vendor-specific format contain a unique identifier for the company that has defined the format. The description of the “name” string - in + in provides alternatives for defining this identifier. Examples of these unique identifiers include stock ticker labels and Organizationally Unique Identifiers (OUIs). Since @@ -548,7 +548,7 @@ Monitoring (a fictional name for the purposes of this example) were to create a vendor-specific log format 12, then bytes 12-15 of such a log may contain “AIM<NULL>”. - + This identifier is intended to apply to the company that defines the specific format, and may be used by other companies that wish to be compatible with that format. For example, if another company wanted to @@ -556,33 +556,33 @@ AIM-specific error log format for logs generated on their own platform, their log would have to contain an identifier of “AIM<NULL>”. - + - + - R1-R1--1. Platforms which support Version 6 of the Extended Event Log Format must do so by including a 0x06 value in the first byte of the RTAS Event Return Format - (Fixed Part) and using the formats described in + (Fixed Part) and using the formats described in (and all subsections under that section). - + Software Implementation Note: OSs running on platforms which support Version 6 of the Extended Event Log Format must - ensure that the length parameter passed in the + ensure that the length parameter passed in the event-scan RTAS call be at least 2 KB. - + - R1-R1--2. - If the length parameter on the RTAS + If the length parameter on the RTAS event-scan call for returning data using Version 6 of the Extended Event Log Format is insufficient to return all the data the platform would otherwise make available, the platform must truncate the @@ -590,26 +590,26 @@ section. - + - R1-R1--3. All event logs returning a Version 6 Platform Event Log format must include the Main-A and Main-B Sections. Other sections are optional depending on the specific event type as - specified in Requirement + specified in Requirement . - + - R1-R1--4. The following sections must be provided as indicated: - + For the Platform error Type, the Primary Service Reference Code @@ -639,12 +639,12 @@ For the HOTPLUG Type, the Hotplug section must be provided. - + - + - + Software Implementation Note: All fields in the Platform Event Log marked “Platform specific information” or @@ -652,7 +652,7 @@ information reserved for platform or platform Service Application use only. That information is not defined in this document. Information in these fields should be ignored by the OS. - + Software Implementation and Architecture Note: All fields currently marked “Reserved” are set to zero by RTAS @@ -660,7 +660,7 @@ the Platform Event Log may be defined in the future in this architecture document for platform specific usage without change to this architecture. - + RTAS General Extended Event Log Format, Version 6 @@ -839,7 +839,7 @@ Detail vendor specific log data. If byte 2, bits 4:7, above, are a value of 14 (Platform Event Log) and bytes 12-16 - are “IBM ”, then see + are “IBM ”, then see for the content of this field. @@ -847,16 +847,16 @@
- +
- +
Platform Event Log Format, Version 6 - + This format is used when byte 2, bits 4:7, of the RTAS General Extended Event Log Version 6 are a value of 14 (Platform Event Log). - + Overview of Platform Event Log Format, Version 6 @@ -898,7 +898,7 @@ 48 - Main-A section (ID = 'PH'). Required section. See + Main-A section (ID = 'PH'). Required section. See for the format. @@ -912,7 +912,7 @@ Main-B section (ID = 'UH'). Required, always follow - Main-A section. See + Main-A section. See for the format. @@ -928,7 +928,7 @@ Logical Resource Identification section (ID = 'LR'). Optional, present only for Resource deallocation event notification. If present, this section always follows Main-B - section. See + section. See for the format. @@ -945,7 +945,7 @@ Primary SRC section (ID = 'PS'). Required for “Platform Error” event type, optional for other event types. If present, this section always follows Main-B - section. See + section. See for the format. @@ -960,7 +960,7 @@ Dump Locator section (ID = 'DH') Optional, present only for dump event notification. If present, this section follows - Main-B or Primary SRC section. See + Main-B or Primary SRC section. See for the format. @@ -975,7 +975,7 @@ EPOW section (ID = 'EP'). Optional, present only for “EPOW” interrupt event notification. If present, - this section follows Main-B section. See + this section follows Main-B section. See for the format. @@ -990,7 +990,7 @@ IO Events section (ID = 'IE'). Optional, present only for “ibm,io-events” interrupt event notification. If - present, this section follows Main-B section. See + present, this section follows Main-B section. See for the format. @@ -1005,7 +1005,7 @@ Failing Enclosure MTMS section (ID = 'MT'). Required for errors only. If present, this section follows Main-B section or - Primary SRC. See + Primary SRC. See for the format. @@ -1034,7 +1034,7 @@ Machine Check Interrupt section (ID = 'MC'). Optional for “Platform Error” event types with ERROR_SYNC severity caused by a machine check interrupt. If present, this - section follows the Main-B. See + section follows the Main-B. See . @@ -1046,8 +1046,8 @@ ??? - Hotplug Section (ID = “HP”). Optional, present only for - Hotplug event notification. If present, this section follows + Hotplug Section (ID = “HP”). Optional, present only for + Hotplug event notification. If present, this section follows Main-B section. See . @@ -1066,12 +1066,12 @@
- +
- +
Platform Event Log Format, Main-A Section - + Platform Event Log Format, Version 6, Main-A Section @@ -1278,12 +1278,12 @@
- +
- +
Platform Event Log Format, Main-B Section - + Platform Event Log Format, Version 6, Main-B Section @@ -1547,20 +1547,20 @@
- +
Error/Event Severity - + This field indicates the severity of the error event and the impact of the error to the platform (if applicable). - - Non-error or Informational Event: + + Non-error or Informational Event: This value indicates an event that is a non-error event. Informational or user action event log entries must use this value. The Event Type field provides additional event information. - - Recovered Error, general: + + Recovered Error, general: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware, e.g. ECC, internal spare or redundancy, cache line @@ -1569,9 +1569,9 @@ Flags has the value of “Hidden Error”. An event log with this value is used primarily for error thresholding design and code debug or as a record to indicate error frequency or trend. - + - Recovered Error, spare capacity utilized: + Recovered Error, spare capacity utilized: This value indicates that an error on a resource has been recovered by utilizing another resource not currently assigned for use (spare). The failing @@ -1580,9 +1580,9 @@ a spare processor, continuing the operations of the faulty one. In this case the failing component is considered permanently in an error state. - + - Recovered Error, loss of entitled capacity: + Recovered Error, loss of entitled capacity: This value indicates that an error on a resource has been recovered by utilizing another resource already in use by the system. The failing component is to be considered permanently in an error state. This results @@ -1594,7 +1594,7 @@ deallocation event notification” and the revised amount of entitled capacity would be found in the Logical Resource Identification Section, Entitled Capacity field. - + Predictive Error, general: This value indicates an event that has been automatically recovered or corrected by the platform hardware and/or @@ -1603,7 +1603,7 @@ action is required. The automatic platform recovery actions have no impact to system performance (e.g. ECC, CRC, etc.), or the impact is unknown. - + Predictive Error, degraded performance: This value indicates an error event that has been automatically recovered or corrected by the @@ -1612,8 +1612,8 @@ unrecoverable error. A deferred service or repair action is required. The automatic platform recovery actions are impacting/degrading system performance. - - Predictive Error, fault may be corrected after + + Predictive Error, fault may be corrected after platform re-boot: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or firmware. However, the @@ -1624,8 +1624,8 @@ after re-boot, then a part replacement is required. The automatic platform recovery actions have no impact to system performance (e.g. ECC, CRC, etc.), or the impact is unknown. - - Predictive Error, fault may be corrected + + Predictive Error, fault may be corrected after platform re-boot, degraded performance: This value indicates an error event that has been automatically recovered or corrected by the platform hardware and/or @@ -1636,7 +1636,7 @@ fault cannot be corrected after re-boot, then a part replacement is required. The automatic platform recovery actions are impacting/degrading the system performance. - + Predictive Error, loss of redundancy: This value indicates an error event that has been automatically recovered or corrected by the platform @@ -1645,7 +1645,7 @@ subsystem may causes platform unrecoverable error. A deferred service or repair action is required to restore redundancy. The loss of redundancy may or may not impact system performance. - + Unrecoverable Error, general: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or @@ -1654,8 +1654,8 @@ platform may be able to re-boot successfully and resume. A service or repair action is required as soon as possible to correct the error. - - Unrecoverable Error, bypassed with degraded + + Unrecoverable Error, bypassed with degraded performance: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform @@ -1664,8 +1664,8 @@ performance is degraded due to the deconfigured platform resource(s) e.g. processor, cache, memory, etc. A deferred service or repair action is required. - - Unrecoverable Error, bypassed with loss + + Unrecoverable Error, bypassed with loss of redundancy: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform @@ -1674,8 +1674,8 @@ platform resource(s) resulted in loss of redundancy (e.g. Redundant FSP with static fail-over) with no loss of system performance. A deferred service or repair action is required. - - Unrecoverable Error, bypassed with loss + + Unrecoverable Error, bypassed with loss of redundancy + performance: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the @@ -1684,8 +1684,8 @@ The deconfigured platform resource(s) resulted in loss of redundancy and system performance. A deferred service or repair action is required. - - Unrecoverable Error, bypassed with loss + + Unrecoverable Error, bypassed with loss of function: This value indicates an error event that is unrecoverable or uncorrectable by the platform hardware and/or firmware. However, the hardware or platform @@ -1693,36 +1693,36 @@ can be IPLed or re-IPLed with the error bypassed. The deconfigured platform resource(s) resulted in loss of platform or system function. A deferred service or repair action is required. - + Error on diagnostic test, general: This value indicates an error event that is detected during a diagnostic test. Impact to the system is undefined or unknown. - - Error on diagnostic test, resource may + + Error on diagnostic test, resource may produce incorrect results: This value indicates an error event that is detected during a diagnostic test. The error may produce incorrect computational results (e.g. processor floating point unit test error). - +
- +
Event Sub-Type - + This field provides additional information on the non-error event type. - + Not applicable: This value is used when the event is associated with an error. Error/Event Severity field and SRC section provide additional error information. - + Miscellaneous, Information Only: This value is used when the event is “for information only” or the event description doesn't fit into any other defined values in this field. - + Dump Notification: This value is used by the hypervisor or partition firmware as a “Dump Notification” event to the OS @@ -1730,32 +1730,32 @@ value is used by the HMC as a “Dump Notification” event to the Service Application to indicate a dump file is present for transmission to the manufacturer. - - Previously reported error has been + + Previously reported error has been corrected by system: This value is used by the platform firmware to indicate that the error event that was previously reported has been corrected by the platform. On a subsequent platform boot, this event type is logged to indicate that the array was successfully repaired. - - System resources manually deconfigured + + System resources manually deconfigured by user: This value is used by the platform firmware to indicate that a subset of platform resource(s) was/were deconfigured due to user's request (e.g. via platform ASM menu). The deconfigured resource(s) is/are not associated with error detected by the platform. The event is a reminder to the user - that the platform is running with partial capacity. + that the platform is running with partial capacity. Note: The platform provides this user option for platform performance testing purpose. - - System resources deconfigured by + + System resources deconfigured by system due to prior error event: This value is used by the platform firmware to indicate that the platform is IPLed with resource(s) deconfigured due to error detected and reported previously. The event is a reminder to the user that the platform requires service. - + Resource deconfiguration notification: This value is used by partition firmware as an “Event Notification” to the OS that @@ -1763,90 +1763,90 @@ by the OS should be deallocated due to predictive error. A Logical Resource Identification section is included in the event log to indicate the Resource Type and ID. - - Customer environmental problem has + + Customer environmental problem has returned to normal: This value is used by the platform firmware to indicate that a customer environmental problem (e.g. utility power, room ambient temperature, etc.) detected and reported previously, has returned to normal. - + Concurrent Maintenance: This value is used by the platform firmware to indicate any non-error event associated with concurrent maintenance activity. - + Capacity Upgrade Event: This value is used by the platform firmware to indicate any non-error event associated with capacity upgrade activity. - + Resource Sparing Event: This value is used by the platform firmware to indicate any non-error event associated with platform resource sparing activity. - + Dynamic Reconfiguration Event: This value is used by the partition firmware to indicate any significant but non-error event associated with - dynamic reconfiguration activity. + dynamic reconfiguration activity. Implementation Note: Due to limited platform storage resource, non-error event log associated with a logical partition will be reported to the OS but may not be stored in the platform. - - Normal system/platform shutdown or + + Normal system/platform shutdown or powered off: This value is used by the platform firmware to indicate any non-error event associated with normal system/platform shutdown or powered off activity initiated by the user. - - Platform powered off by user without + + Platform powered off by user without normal shutdown (abnormal powered off): This value is used by the platform firmware to indicate that the platform is abnormally powered off by the user. - +
- +
Error Action Flags - + The following are the definitions of the actions taken for the various Error Action Flags. - + Report Externally - This flag instructs the service processor (error logger component) to send the error to the service application (e.g. service focal point(s) or FNM error analyzer). If this flag is set, the SP always sends the error: - + - + To the “managing HMC(s)” if one (or multiple) exists. - + - + And to the hypervisor (unless the “Don't report to hypervisor” flag is also set). - + - + Service Action Required - This flag instructs the Service application that some service action is required by either the customer or by the manufacturer’s service personnel. This is equivalent to saying Customer Notification is required. Contrast this flag with the “Call Home Required” flag. - + Call Home Required - This flag indicates that the error requires service and a Call Home Operation is to be performed. There are additional policies used in combination with this flag: what subsystem performs the Call Home, what is sent and where it is sent. - + Hidden Error - This flag allows errors to be placed in a partition's OS error log, but still remain hidden from the customer. This @@ -1855,7 +1855,7 @@ flag to the OS. Note that this flag has no impact on the SP reporting errors to either the HMC or hypervisor or for the hypervisor reporting errors to partitions. - + Don't report Error to hypervisor - While a partition is booting and before it is functional (e.g. no OS error logging available), partition @@ -1864,12 +1864,12 @@ this flag to indicate that they need not be sent back to the hypervisor. This is due to the error scope being limited to the failing partition and the hypervisor has already taken the appropriate actions. - - Incomplete Information for Error + + Incomplete Information for Error Isolation - Some errors are not contained to a single enclosure and require error isolation from an entity with broader system view / scope. - + Software Error - This flag is used by the partition error logger to indicate to the error is most likely to be caused by the software. When @@ -1877,22 +1877,22 @@ by either software or hardware. The Software Error and Hardware Error flags are used to trigger the manufacturer’s support system to automatically download software or firmware fixes. - + Hardware Error - This flag is used by the partition error logger to indicate to the error is most likely to be caused by the hardware. The Software Error and Hardware Error flags are used to trigger the manufacturer’s support system to automatically download software or firmware fixes. - +
- +
- +
Platform Event Log Format, Logical Resource Identification section - + Platform Event Log Format, Version 6, Logical Resource Identification Section @@ -2042,12 +2042,12 @@
- +
- +
Platform Event Log Format, Primary SRC Section - + Platform Event Log Format, Version 6, Primary SRC Section @@ -2352,7 +2352,7 @@
- + Platform Event Log Format, Version 6, FRU Call-out Structure @@ -2582,68 +2582,68 @@
- +
FRU Replacement or Maintenance Procedure Priority - + This field defines the service priority of the specific call-out, i.e., replacing the FRU part number or performing the maintenance procedure ID as given in the FRU/Procedure Identity substructure. Here are the priority descriptions: - + - - 'H' = High priority and + + 'H' = High priority and mandatory call-out. Replacing the FRU (or performing the maintenance procedure) is mandatory. If multiple call-outs with 'H' priority are given, all must be replaced or performed as a group. - + - - 'M' = Medium priority. + + 'M' = Medium priority. Replacing the FRU (or performing maintenance procedure) with 'M' priority one at a time in the order given after all call-outs prior to this one, if present, are performed. - + - - 'A' = Medium priority group A + + 'A' = Medium priority group A (1st group). Replacing all the FRUs with 'A' priority as a group after all call-outs prior to this group, if present, are performed. - + - - 'B' = Medium priority group B + + 'B' = Medium priority group B (2nd group). Replacing all the FRUs with 'B' priority as a group after all call-outs prior to this group, if present, are performed. - + - - 'C' = Medium priority group C + + 'C' = Medium priority group C (3rd group). Replacing all the FRUs with 'C' priority as a group after all call-outs prior to this group, if present, are performed. - + - - 'L' = Low priority. After + + 'L' = Low priority. After performed all the prior call-outs, if present, and problem still persists, replacing the FRU with this priority one at a time in the order given. - + - + The list of FRU/Procedure call-outs in the “call-out” subsection of the SRC structure must be in order as defined above, i.e. High, Medium, Low. 'M' has the same medium priority level as 'A', 'B', or @@ -2651,31 +2651,31 @@ 'C'. A group call-out must be contiguous in the list. Within the medium priority level, follow the call-out order in the list A list without High or Medium priority is also valid. - +
- +
Failing Component Type Description - + - + Normal Hardware FRU: Hardware FRU in the platform which the platform firmware or code can positively identify, and its VPD contains the part number and associated information. - + - + Code FRU: Some layer of platform firmware or OS code is suspected. The procedure ID field provides additional information about which code(s) is/are the potential problem. - + - + Configuration error: The problem may be related to how hardware or code is configured. For example, an adapter is plugged in a slot that @@ -2683,21 +2683,21 @@ reason to use one of these is if the analysis can provide more information to the customer and service provider by giving a location code. - + - + Maintenance procedure required: Further isolation of the problem is required by performing the procedure as identified in the Procedure ID field. Procedures are designed to help to isolate problems and guide the service provider through identifying which FRUs to replace in which order. - + - - Symbolic FRU: Used for a single + + Symbolic FRU: Used for a single FRU where the analysis code knows exactly what the part is but there is no part number, or the part number cannot be pulled from VPD, or when there is something special (like a @@ -2705,41 +2705,41 @@ or FRUs without VPD (so a part number cannot be filled in). The term “Symbolic” simply means “not an actual part number”. - + - - External FRU: A failing part(s) + + External FRU: A failing part(s) which is/are not in the system, e.g. attached storage sub-system, network hubs/switches, external drives like CD/DVD boxes. - + - - External Code: Code not running + + External Code: Code not running in the platform but is the potential source of the error. This could be something like storage subsystem code or even another system in the same cluster. - + - - Tool FRU: This is a special + + Tool FRU: This is a special tool that will be required by one of the FRUs in the list. Tools are only added as FRUs when they are not part of the CE tool kit and therefore the repair action could be delayed if the CE did not know to bring it. Examples are Optical Cleaning Kits for fiber channel, and special tools for torque or reach or weight considerations. - + - +
- +
- +
Platform Event Log Format, Dump Locator Section   @@ -2918,12 +2918,12 @@ - +
- +
Platform Event Log Format, EPOW Section - + Platform Event Log Format, Version 6, EPOW Section @@ -3079,12 +3079,12 @@
- +
- +
Platform Event Log Format, IO Events Section - + Platform Event Log Format, Version 6, IO Events Section @@ -3233,7 +3233,8 @@ 0x04 = Node off-line0x05 = platform-dump-max-size change0x08 = Generic Notification - 0x09 = NVDIMM status change + 0x09 = Platform protection of NVDIMM contents enabled + 0x0A = Platform protection of NVDIMM contents disabledAll other values = Reserved @@ -3263,7 +3264,7 @@ For the platform-dump-max-size change I/O-Event Sub-Type: 8 bytes for the new value of the platform-dump-max-size system parameter (specifying the sum (in bytes) of the maximum size of - each unique platform dump type that the + each unique platform dump type that the ibm,platform-dump RTAS call could return).  @@ -3275,13 +3276,13 @@
- +
- +
Platform Event Log Format, Failing Enclosure MTMS - + Platform Event Log Format, Version 6, Failing Enclosure MTMS @@ -3395,25 +3396,25 @@ The Failing Enclosure Machine Type, Model, and Serial Number (MTMS) that is associated with the error is important for service and support. - + The source of information for the MTMS fields varies according to the following: - + For CEC errors, it is the CEC enclosure MTMS. - + For errors in I/O enclosures (drawers and towers) that have their own MTMS and are sold as separate MTMS from the CEC, we use the I/O Drawer MTMS. - + For I/O enclosures that were sold as a feature, this section contains the Feature Code and Serial Number of the I/O enclosure. When the Feature Code is used, it is left justified in the Machine Type and @@ -3421,12 +3422,12 @@ - + - +
Platform Event Log Format, Impacted Partitions - +
Platform Event Log Format, Version 6, Impacted Partitions @@ -3568,17 +3569,17 @@
- + This section describes partitions that are impacted by an error. When this section is supplied, the partitions in this list (and only these partitions) are notified of the error. - +
- +
Platform Event Log Format, Failing Memory Address - + Platform Error Event Log Format, Version 6, Failing Memory Address @@ -3716,7 +3717,7 @@
- + UE Error Information @@ -4060,17 +4061,17 @@
- + For an error log that has the machine check interrupt section filled out, the platform is not required to provide the date and time stamp in the main-a section. The fields will be binary zeroes if the date and time stamp is not provided. - +
Platform Event Log Format, Hotplug Section - + Platform Error Event Log Format, Version 6, Hotplug Section @@ -4100,7 +4101,7 @@ 2 - Section ID: A two-ASCII character field which uniquely + Section ID: A two-ASCII character field which uniquely identifies the type of section. Value = “HP”. @@ -4112,7 +4113,7 @@ 2 - Section length: Length in bytes of the section, including + Section length: Length in bytes of the section, including the section ID. @@ -4221,14 +4222,14 @@ 1 - 0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug - Identifier, the OS should take steps to verify the entirety of the request can be satisfied - before proceeding with the hotplug / unplug operations. If only a partial count can be - satisfied, the OS should ignore the entirety of the request. If the OS cannot determine - this beforehand, it should satisfy the hotplug / unplug request for as many of the + 0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug + Identifier, the OS should take steps to verify the entirety of the request can be satisfied + before proceeding with the hotplug / unplug operations. If only a partial count can be + satisfied, the OS should ignore the entirety of the request. If the OS cannot determine + this beforehand, it should satisfy the hotplug / unplug request for as many of the requested resources as possible, and attempt to revert to the original OS / DRC state. - 1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the - Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible, + 1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the + Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible, even if it cannot be satisfied for all the DRCs specified. @@ -4240,7 +4241,7 @@ Reserved - + 0x0C @@ -4251,11 +4252,11 @@ Hotplug Identifier Variable length field depending on the Hotplug Identifier Type specified. - For drc name, this field is a null-terminated ASCII character field containing + For drc name, this field is a null-terminated ASCII character field containing the drc name of the resource to hotplug. For drc index, this is 4 byte field with the drc index of the resource to hotplug. For drc count, this is a 4 byte field with the number of resources to hotplug. - For drc count indexed, this is two 4 byte fields the first being the number of resources + For drc count indexed, this is two 4 byte fields the first being the number of resources to hotplug and the second being the drc index at which to start. @@ -4269,9 +4270,9 @@ Hotplug Token Present only if corresponding Hotplug Event Capability bit is set. - Integer value that can be used in conjunction with other fields of the hotplug - event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate - hotplug event with the request which generated it for the purposes of providing + Integer value that can be used in conjunction with other fields of the hotplug + event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate + hotplug event with the request which generated it for the purposes of providing feedback to the requestor, such as debugging or error information. @@ -4280,5 +4281,5 @@
- +