diff --git a/Error Handling/sec_rtas_error_reporting_return_format.xml b/Error Handling/sec_rtas_error_reporting_return_format.xml
index 771aaf9..7a33768 100644
--- a/Error Handling/sec_rtas_error_reporting_return_format.xml
+++ b/Error Handling/sec_rtas_error_reporting_return_format.xml
@@ -1,7 +1,7 @@
-
RTAS Error/Event Return Format
-
+
This section describes in detail the return value retrieved by an
- RTAS call to either the
- event-scan or
+ RTAS call to either the
+ event-scan or
check-exception function.
-
+
The return value consists of a fixed part and an optional Extended
Error Report, described in the next section, which contains full details
of the error. The fixed part is intended to allow reporting the most
@@ -36,15 +36,15 @@
strategy. At the same time, the mechanism is capable of providing full
disclosure of the error syndrome information for OSs which have a more
complete error handling strategy.
-
+
RTAS can return at most one return code per invocation. If multiple
conditions exist, RTAS returns them in descending order of severity on
successive calls.
-
+
Reporting and Recovery Philosophy, and Description of
Fields
-
+
All firmware implementations use a common error and event reporting
scheme, as described in detail below. It is not required that error
recovery be present in firmware implementations, nor is it required that
@@ -56,7 +56,7 @@
Report format, and the philosophy which should be applied in generating
return values from firmware or interpreting such return codes in an
OS.
-
+
In general, an OS would look at the Disposition field first to see
if an error has been corrected already by firmware. If not corrected to
the OS’s satisfaction, the OS would examine the Severity field.
@@ -65,35 +65,35 @@
continue or to halt operations. In either case, it may choose to log
information regarding the error, using the remaining fields and optional
Extended Error Log.
-
- The following sections describe the field values in
+
+ The following sections describe the field values in
.
-
+
Version
-
+
This field is used only to distinguish among present and potential
future formats for the remainder of the error report. This value will be
incremented if extensions are made to the format described here. The
primary function of this field is for future OSs to identify whether an
error report may contain some (unknown at present) feature that was added
after the initial version of this specification.
-
+
-
+
Severity
-
+
This field represents the value judgment of firmware of how serious
the problem being reported should be considered by the OS.
-
+
Errors which are believed to represent a permanent hardware failure
affecting the entire system are considered “FATAL.” OSs would
not attempt to continue normal operation after receiving notice of such
an error. OSs may not even be able to perform an orderly shutdown in the
presence of a Fatal error, though they may make a policy decision to
try.
-
+
Less serious errors, but still causing a loss of data or state, are
considered “ERRORs.” In general, continuing after such an
error is questionable, since details of what has failed may not be
@@ -101,7 +101,7 @@
with which the OS can associate it. However, OSs may make a policy
decision (for example, based on the error Type, the Initiator, or the
Target) to continue operation after an Error.
-
+
There are some types of errors, such as parity errors in memory or
a parity error on a transfer between CPU and memory, which occur
synchronously with the current process execution context. Such errors are
@@ -116,33 +116,33 @@
are reported as having Severity “ERROR_SYNC”. It is OS
dependent whether recovery is possible after such an error, or whether
the OS will treat it as a fatal problem.
-
+
The “WARNING” return value indicates that a
non-state-losing error, either fully recovered by firmware or not needing
recovery, has occurred. No OS action is required, and full operation is
expected to continue unhindered by the error. Examples include corrected
ECC errors or bus transfer failures which were re-tried
successfully.
-
+
The “EVENT” return value is the mechanism firmware uses
to communicate event information to the OS. The event may have been
- detected by polling using
+ detected by polling using
event-scan or on the occurrence of an interrupt by
- calling
+ calling
check-exception. In either case, the Error Return
value indicates the event which has occurred in the Type field. See the
Type description below for a description of specific events and their
expected handling.
-
+
The “NO_ERROR” return value indicates that no error was
present. In this case, the remainder of the Error Return fields are not
valid and should not be referenced.
-
+
-
+
RTAS Disposition
-
+
An aggressive firmware implementation may choose to attempt
recovery for some classes of error so an OS can continue operation in the
face of recoverable errors. If firmware detects an error for which it has
@@ -152,13 +152,13 @@
Severity says how serious an error was, and Disposition says, regardless
of severity, whether or not the OS has to even look at it. In general, an
OS will first examine Disposition, then Severity.
-
+
A return value of “FULLY RECOVERED” means that RTAS was
able to completely recover the machine state after the error, and OS
operation can continue unhindered. The severity of the problem in this
case is irrelevant, though for consistency a “FATAL” error
can never be “FULLY RECOVERED.”
-
+
A return value of “LIMITED RECOVERY” means that RTAS
was able to recover the state of the machine, but that some feature of
the machine has been disabled or lost (for example, error checking), or
@@ -169,36 +169,36 @@
“NOT RECOVERED,” and initiate shutdown. A less conservative
OS may choose to let the user decide whether to continue or to shut
down.
-
+
A value of “NOT RECOVERED” indicates that the RTAS
either did not attempt recovery, or that it attempted recovery but was
unsuccessful.
-
+
-
+
Optional Part Presence
-
+
This is a single flag, valid only if the 32-bit Error Return value
is located in memory, which indicates whether or not an Extended Error
Log Length field and the Extended Error Log follows it in memory. It will
be set on an in-memory return result from RTAS if and only if the RTAS
call indicated sufficient space to return the Extended Error Log, and the
RTAS implementation supports the Extended Error Log.
-
+
-
+
Initiator
-
+
This field indicates, to the best ability of RTAS to determine it,
the initiator of a failed transaction. (Note that in the
- “Initiator” field of
+ “Initiator” field of
, the value “I/O”
indicates one of the defined I/O buses or IOAs. This field contains
finer-grained details of which type of I/O bus failed, if known, and
“UNKNOWN” if RTAS cannot tell.)
-
+
In many of the newer LoPAR platforms, the platform error
notification and handling flow is asynchronous to the OS and software
execution flow, therefore the context of Initiator is not applicable to
@@ -206,15 +206,15 @@
Not Applicable” is used for Initiator. In logs created with Version
6 or later, more detailed information about the error is provided in the
Platform Event log format.
-
+
-
+
Target
-
+
If RTAS can determine it, this field indicates the target of a
failed transaction.
-
+
In many of the newer LoPAR platforms, the platform error
notification and handling flow is asynchronous to the OS and software
execution flow, therefore the context of Target is not applicable to the
@@ -222,12 +222,12 @@
Applicable” is used for Target. In logs created with Version 6 or
later, more detailed information about the error are provided in the
Platform Event log format.
-
+
-
+
Type
-
+
This field identifies the general type of the error or event. In
some cases (for example, INTERN_DEV_FAIL), multiple possible events are
grouped together under a common return value. In such cases,
@@ -235,17 +235,17 @@
them. Non-platform-aware software will generally treat all errors of a
given type the same, so it generally will not need to access the Extended
Error Log information.
-
+
In the table, the EPOW values are associated with a Severity of
EVENT. All other values will be associated with Severity values of FATAL,
ERROR, ERROR_SYNC, or WARNING, and may or may not be corrected by
RTAS.
-
+
EPOW is an event type which indicates the potential loss of power
or environmental conditions outside the limits of safe operation of the
- platform. See
+ platform. See
for more information.
-
+
The “Platform Error (224)”, introduced for Version 6,
generalizes that the error is identified by the platform and the specific
details are encoded in the Platform Event log format itself.
@@ -254,12 +254,12 @@
event notification, the Platform Event Log contains the “IO
Events” section which identifies additional details associated with
the event.
-
+
The “Platform information event (226)” indicates the
return log should be logged as “Information Log”. These logs
indicate key platform events and can be used for reference
purposes.
-
+
The “Resource deallocation event (227)” indicates an
event notification to the OS that a specific hardware resource has
experienced recurring recoverable errors with a trend toward
@@ -269,23 +269,23 @@
Identification” section which identifies the “Logical
Entity” by Resource Type and Resource ID, associated with the
event.
-
+
The “Dump notification event (228)” indicates that a
dump file is present in the platform and is available for retrieval by
the OS. For this type of event notification, the Platform Event Log
contains the “Dump Locator” section which contains additional
event specific information.
-
+
Additional Type values will be added in future revisions of the
specification. If an OS does not recognize a particular event type, it
can examine the severity first, and then choose to ignore the event if it
is not serious.
-
+
-
+
Extended Event Log Length / Change Scope
-
+
This optional 32-bit field is present in memory following the
32-bit Event Return value if the Optional Part Presence flag is
“PRESENT”, and it indicates the length in bytes of the
@@ -294,24 +294,24 @@
may be zero. The field is also present for a resource change “Hot
Plug” event, such as a PRRN event, and then represents the scope of
a resource change.
-
+
-
+
RTAS Event Return Format Fixed Part
-
+
The summary portion of the error return is designed to fit into a
single 32-bit integer. When used as a data return format in memory, an
optional Length field and Extended Error Log data may follow the summary.
The fixed part contains a “presence” flag which identifies
whether an extended report is present.
-
+
In the table below, the location of each field within the integer
is included in parentheses after its name. Numerical field values are
indicated in decimal unless noted otherwise.
- RTAS Event Return Format (Fixed Part)
+ RTAS Event Return Format (Fixed Part)
(Continued)
@@ -326,7 +326,7 @@
- Description, Values (Described in
+ Description, Values (Described in
)
@@ -480,9 +480,9 @@
Length in bytes of Extended Event Log information which
- follows (see
+ follows (see
) OR the scope
- parameter to be input the
+ parameter to be input the
ibm,update-nodes RTAS to retrieve the nodes
that were changed by selected “Hot Plug”
events.
@@ -491,7 +491,7 @@
-
+
Typically, most OSs care about, and have handlers for, only a few
specific errors. Since coding of an error is unique in the above scheme,
an OS can check for specific errors, then if nothing matches exactly,
@@ -500,17 +500,17 @@
that RTAS delivered to the OS. Platforms may provide more complete error
diagnosis and reporting in RTAS, combined with off-line diagnostics which
take advantage of the information reported from previous failures.
-
+
-
+
Version 6 Extensions of Event Log Format
RTAS General Extended Event Log Format, Version 6
-
+
The following section defines new extensions to the event log
format which are identified by a Version number 0x06 in the first byte in
the returned buffer (byte 0 of the fixed-part information). The following
@@ -520,23 +520,23 @@
sections of this chapter. This format is also intended to be usable as
residual error log data in NVRAM, so that the OS could alternatively
retrieve error data after an error event which caused a reboot.
-
+
Platforms indicate the maximum length of the error log buffer in
- the
+ the
“rtas-error-log-max” RTAS property in the
OF device tree, so that the OS can allocate a buffer large enough to hold
- the extended error log data when calling the RTAS
- event-scan or
+ the extended error log data when calling the RTAS
+ event-scan or
check-exception functions. If the allocated buffer is
not large enough to hold all the error log data, the data is truncated to
the size of the buffer.
-
- Requirement
- and
+
+ Requirement
+ and
require that four bytes of the
vendor-specific format contain a unique identifier for the company that
has defined the format. The description of the “name” string
- in
+ in
provides alternatives for
defining this identifier. Examples of these unique identifiers include
stock ticker labels and Organizationally Unique Identifiers (OUIs). Since
@@ -548,7 +548,7 @@
Monitoring (a fictional name for the purposes of this example) were to
create a vendor-specific log format 12, then bytes 12-15 of such a log
may contain “AIM<NULL>”.
-
+
This identifier is intended to apply to the company that defines
the specific format, and may be used by other companies that wish to be
compatible with that format. For example, if another company wanted to
@@ -556,33 +556,33 @@
AIM-specific error log format for logs generated on their own platform,
their log would have to contain an identifier of
“AIM<NULL>”.
-
+
-
+
- R1-R1--1.
Platforms which
support Version 6 of the Extended Event Log Format must do so by
including a 0x06 value in the first byte of the RTAS Event Return Format
- (Fixed Part) and using the formats described in
+ (Fixed Part) and using the formats described in
(and all subsections under that
section).
-
+
Software Implementation Note: OSs running on
platforms which support Version 6 of the Extended Event Log Format must
- ensure that the length parameter passed in the
+ ensure that the length parameter passed in the
event-scan RTAS call be at least 2 KB.
-
+
- R1-R1--2.
- If the length parameter on the RTAS
+ If the length parameter on the RTAS
event-scan call for returning data using Version 6 of
the Extended Event Log Format is insufficient to return all the data the
platform would otherwise make available, the platform must truncate the
@@ -590,26 +590,26 @@
section.
-
+
- R1-R1--3.
All event logs returning a Version 6
Platform Event Log format must include the Main-A and Main-B Sections.
Other sections are optional depending on the specific event type as
- specified in Requirement
+ specified in Requirement
.
-
+
- R1-R1--4.
The following
sections must be provided as indicated:
-
+
For the Platform error Type, the Primary Service Reference Code
@@ -639,12 +639,12 @@
For the HOTPLUG Type, the Hotplug section must be provided.
-
+
-
+
-
+
Software Implementation Note: All fields in the
Platform Event Log marked “Platform specific information” or
@@ -652,7 +652,7 @@
information reserved for platform or platform Service Application use
only. That information is not defined in this document. Information in
these fields should be ignored by the OS.
-
+
Software Implementation and Architecture Note: All
fields currently marked “Reserved” are set to zero by RTAS
@@ -660,7 +660,7 @@
the Platform Event Log may be defined in the future in this architecture
document for platform specific usage without change to this
architecture.
-
+
RTAS General Extended Event Log Format, Version 6
@@ -839,7 +839,7 @@
Detail vendor specific log data. If byte 2, bits 4:7,
above, are a value of 14 (Platform Event Log) and bytes 12-16
- are “IBM ”, then see
+ are “IBM ”, then see
for the content of
this field.
@@ -847,16 +847,16 @@
-
+
-
+
Platform Event Log Format, Version 6
-
+
This format is used when byte 2, bits 4:7, of the RTAS General
Extended Event Log Version 6 are a value of 14 (Platform Event
Log).
-
+
Overview of Platform Event Log Format, Version
6
@@ -898,7 +898,7 @@
48
- Main-A section (ID = 'PH'). Required section. See
+ Main-A section (ID = 'PH'). Required section. See
for the
format.
@@ -912,7 +912,7 @@
Main-B section (ID = 'UH'). Required, always follow
- Main-A section. See
+ Main-A section. See
for the
format.
@@ -928,7 +928,7 @@
Logical Resource Identification section (ID = 'LR').
Optional, present only for Resource deallocation event
notification. If present, this section always follows Main-B
- section. See
+ section. See
for the
format.
@@ -945,7 +945,7 @@
Primary SRC section (ID = 'PS'). Required for
“Platform Error” event type, optional for other
event types. If present, this section always follows Main-B
- section. See
+ section. See
for the
format.
@@ -960,7 +960,7 @@
Dump Locator section (ID = 'DH') Optional, present only
for dump event notification. If present, this section follows
- Main-B or Primary SRC section. See
+ Main-B or Primary SRC section. See
for the
format.
@@ -975,7 +975,7 @@
EPOW section (ID = 'EP'). Optional, present only for
“EPOW” interrupt event notification. If present,
- this section follows Main-B section. See
+ this section follows Main-B section. See
for the
format.
@@ -990,7 +990,7 @@
IO Events section (ID = 'IE'). Optional, present only for
“ibm,io-events” interrupt event notification. If
- present, this section follows Main-B section. See
+ present, this section follows Main-B section. See
for the
format.
@@ -1005,7 +1005,7 @@
Failing Enclosure MTMS section (ID = 'MT'). Required for
errors only. If present, this section follows Main-B section or
- Primary SRC. See
+ Primary SRC. See
for the
format.
@@ -1034,7 +1034,7 @@
Machine Check Interrupt section (ID = 'MC'). Optional for
“Platform Error” event types with ERROR_SYNC
severity caused by a machine check interrupt. If present, this
- section follows the Main-B. See
+ section follows the Main-B. See
.
@@ -1046,8 +1046,8 @@
???
- Hotplug Section (ID = “HP”). Optional, present only for
- Hotplug event notification. If present, this section follows
+ Hotplug Section (ID = “HP”). Optional, present only for
+ Hotplug event notification. If present, this section follows
Main-B section. See .
@@ -1066,12 +1066,12 @@
-
+
-
+
Platform Event Log Format, Main-A Section
-
+
Platform Event Log Format, Version 6, Main-A
Section
@@ -1278,12 +1278,12 @@
-
+
-
+
Platform Event Log Format, Main-B Section
-
+
Platform Event Log Format, Version 6, Main-B
Section
@@ -1547,20 +1547,20 @@
-
+
Error/Event Severity
-
+
This field indicates the severity of the error event and the impact
of the error to the platform (if applicable).
-
- Non-error or Informational Event:
+
+ Non-error or Informational Event:
This value indicates an event
that is a non-error event. Informational or user action event log entries
must use this value. The Event Type field provides additional event
information.
-
- Recovered Error, general:
+
+ Recovered Error, general:
This value indicates an error event that
has been automatically recovered or corrected by the platform hardware
and/or firmware, e.g. ECC, internal spare or redundancy, cache line
@@ -1569,9 +1569,9 @@
Flags has the value of “Hidden Error”. An event log with this
value is used primarily for error thresholding design and code debug or
as a record to indicate error frequency or trend.
-
+
- Recovered Error, spare capacity utilized:
+ Recovered Error, spare capacity utilized:
This value
indicates that an error on a resource has been recovered by utilizing
another resource not currently assigned for use (spare). The failing
@@ -1580,9 +1580,9 @@
a spare processor, continuing the operations of the faulty one. In this
case the failing component is considered permanently in an error
state.
-
+
- Recovered Error, loss of entitled capacity:
+ Recovered Error, loss of entitled capacity:
This value indicates that an error on a resource has been recovered by
utilizing another resource already in use by the system. The failing
component is to be considered permanently in an error state. This results
@@ -1594,7 +1594,7 @@
deallocation event notification” and the revised amount of entitled
capacity would be found in the Logical Resource Identification Section,
Entitled Capacity field.
-
+
Predictive Error, general:
This value indicates an event that has
been automatically recovered or corrected by the platform hardware and/or
@@ -1603,7 +1603,7 @@
action is required. The automatic platform recovery actions have no
impact to system performance (e.g. ECC, CRC, etc.), or the impact is
unknown.
-
+
Predictive Error, degraded performance:
This value indicates an
error event that has been automatically recovered or corrected by the
@@ -1612,8 +1612,8 @@
unrecoverable error. A deferred service or repair action is required. The
automatic platform recovery actions are impacting/degrading system
performance.
-
- Predictive Error, fault may be corrected after
+
+ Predictive Error, fault may be corrected after
platform re-boot:
This value indicates an error event that has been automatically recovered
or corrected by the platform hardware and/or firmware. However, the
@@ -1624,8 +1624,8 @@
after re-boot, then a part replacement is required. The automatic
platform recovery actions have no impact to system performance (e.g. ECC,
CRC, etc.), or the impact is unknown.
-
- Predictive Error, fault may be corrected
+
+ Predictive Error, fault may be corrected
after platform re-boot, degraded performance:
This value indicates an error event that has been
automatically recovered or corrected by the platform hardware and/or
@@ -1636,7 +1636,7 @@
fault cannot be corrected after re-boot, then a part replacement is
required. The automatic platform recovery actions are impacting/degrading
the system performance.
-
+
Predictive Error, loss of redundancy:
This value indicates an error
event that has been automatically recovered or corrected by the platform
@@ -1645,7 +1645,7 @@
subsystem may causes platform unrecoverable error. A deferred service or
repair action is required to restore redundancy. The loss of redundancy
may or may not impact system performance.
-
+
Unrecoverable Error, general:
This value indicates an error event
that is unrecoverable or uncorrectable by the platform hardware and/or
@@ -1654,8 +1654,8 @@
platform may be able to re-boot successfully and resume. A service or
repair action is required as soon as possible to correct the
error.
-
- Unrecoverable Error, bypassed with degraded
+
+ Unrecoverable Error, bypassed with degraded
performance: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
@@ -1664,8 +1664,8 @@
performance is degraded due to the deconfigured platform resource(s) e.g.
processor, cache, memory, etc. A deferred service or repair action is
required.
-
- Unrecoverable Error, bypassed with loss
+
+ Unrecoverable Error, bypassed with loss
of redundancy: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
@@ -1674,8 +1674,8 @@
platform resource(s) resulted in loss of redundancy (e.g. Redundant FSP
with static fail-over) with no loss of system performance. A deferred
service or repair action is required.
-
- Unrecoverable Error, bypassed with loss
+
+ Unrecoverable Error, bypassed with loss
of redundancy + performance:
This value indicates an error event that is unrecoverable or
uncorrectable by the platform hardware and/or firmware. However, the
@@ -1684,8 +1684,8 @@
The deconfigured platform resource(s) resulted in loss of redundancy and
system performance. A deferred service or repair action is
required.
-
- Unrecoverable Error, bypassed with loss
+
+ Unrecoverable Error, bypassed with loss
of function: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
@@ -1693,36 +1693,36 @@
can be IPLed or re-IPLed with the error bypassed. The deconfigured
platform resource(s) resulted in loss of platform or system function. A
deferred service or repair action is required.
-
+
Error on diagnostic test, general:
This value indicates an error
event that is detected during a diagnostic test. Impact to the system is
undefined or unknown.
-
- Error on diagnostic test, resource may
+
+ Error on diagnostic test, resource may
produce incorrect results:
This value indicates an error event that is detected during a diagnostic
test. The error may produce incorrect computational results (e.g.
processor floating point unit test error).
-
+
-
+
Event Sub-Type
-
+
This field provides additional information on the non-error event
type.
-
+
Not applicable:
This value is used when the event is associated
with an error. Error/Event Severity field and SRC section provide
additional error information.
-
+
Miscellaneous, Information Only:
This value is used when the event
is “for information only” or the event description doesn't
fit into any other defined values in this field.
-
+
Dump Notification:
This value is used by the hypervisor or
partition firmware as a “Dump Notification” event to the OS
@@ -1730,32 +1730,32 @@
value is used by the HMC as a “Dump Notification” event to
the Service Application to indicate a dump file is present for
transmission to the manufacturer.
-
- Previously reported error has been
+
+ Previously reported error has been
corrected by system: This value
is used by the platform firmware to indicate that the error event that
was previously reported has been corrected by the platform. On a
subsequent platform boot, this event type is logged to indicate that the
array was successfully repaired.
-
- System resources manually deconfigured
+
+ System resources manually deconfigured
by user: This value is used
by the platform firmware to indicate that a subset of platform
resource(s) was/were deconfigured due to user's request (e.g. via
platform ASM menu). The deconfigured resource(s) is/are not associated
with error detected by the platform. The event is a reminder to the user
- that the platform is running with partial capacity.
+ that the platform is running with partial capacity.
Note: The platform
provides this user option for platform performance testing
purpose.
-
- System resources deconfigured by
+
+ System resources deconfigured by
system due to prior error event:
This value is used by the platform firmware to indicate that the platform
is IPLed with resource(s) deconfigured due to error detected and reported
previously. The event is a reminder to the user that the platform
requires service.
-
+
Resource deconfiguration notification:
This value is used by
partition firmware as an “Event Notification” to the OS that
@@ -1763,90 +1763,90 @@
by the OS should be deallocated due to predictive error. A Logical
Resource Identification section is included in the event log to indicate
the Resource Type and ID.
-
- Customer environmental problem has
+
+ Customer environmental problem has
returned to normal: This value
is used by the platform firmware to indicate that a customer
environmental problem (e.g. utility power, room ambient temperature,
etc.) detected and reported previously, has returned to normal.
-
+
Concurrent Maintenance:
This value is used by the platform firmware
to indicate any non-error event associated with concurrent maintenance
activity.
-
+
Capacity Upgrade Event:
This value is used by the platform firmware
to indicate any non-error event associated with capacity upgrade
activity.
-
+
Resource Sparing Event:
This value is used by the platform firmware
to indicate any non-error event associated with platform resource sparing
activity.
-
+
Dynamic Reconfiguration Event:
This value is used by the partition
firmware to indicate any significant but non-error event associated with
- dynamic reconfiguration activity.
+ dynamic reconfiguration activity.
Implementation Note: Due to limited
platform storage resource, non-error event log associated with a logical
partition will be reported to the OS but may not be stored in the
platform.
-
- Normal system/platform shutdown or
+
+ Normal system/platform shutdown or
powered off: This value is used
by the platform firmware to indicate any non-error event associated with
normal system/platform shutdown or powered off activity initiated by the
user.
-
- Platform powered off by user without
+
+ Platform powered off by user without
normal shutdown (abnormal powered off):
This value is used by the platform firmware to indicate
that the platform is abnormally powered off by the user.
-
+
-
+
Error Action Flags
-
+
The following are the definitions of the actions taken for the
various Error Action Flags.
-
+
Report Externally -
This flag instructs the service processor
(error logger component) to send the error to the service application
(e.g. service focal point(s) or FNM error analyzer). If this flag is set,
the SP always sends the error:
-
+
-
+
To the “managing HMC(s)” if one (or multiple)
exists.
-
+
-
+
And to the hypervisor (unless the “Don't report to
hypervisor” flag is also set).
-
+
-
+
Service Action Required -
This flag instructs the Service
application that some service action is required by either the customer
or by the manufacturer’s service personnel. This is equivalent to
saying Customer Notification is required. Contrast this flag with the
“Call Home Required” flag.
-
+
Call Home Required -
This flag indicates that the error requires
service and a Call Home Operation is to be performed. There are
additional policies used in combination with this flag: what subsystem
performs the Call Home, what is sent and where it is sent.
-
+
Hidden Error -
This flag allows errors to be placed in a
partition's OS error log, but still remain hidden from the customer. This
@@ -1855,7 +1855,7 @@
flag to the OS. Note that this flag has no impact on the SP reporting
errors to either the HMC or hypervisor or for the hypervisor reporting
errors to partitions.
-
+
Don't report Error to hypervisor -
While a partition is booting and
before it is functional (e.g. no OS error logging available), partition
@@ -1864,12 +1864,12 @@
this flag to indicate that they need not be sent back to the hypervisor.
This is due to the error scope being limited to the failing partition and
the hypervisor has already taken the appropriate actions.
-
- Incomplete Information for Error
+
+ Incomplete Information for Error
Isolation - Some errors are not
contained to a single enclosure and require error isolation from an
entity with broader system view / scope.
-
+
Software Error -
This flag is used by the partition error logger to
indicate to the error is most likely to be caused by the software. When
@@ -1877,22 +1877,22 @@
by either software or hardware. The Software Error and Hardware Error
flags are used to trigger the manufacturer’s support system to
automatically download software or firmware fixes.
-
+
Hardware Error -
This flag is used by the partition error logger to
indicate to the error is most likely to be caused by the hardware. The
Software Error and Hardware Error flags are used to trigger the
manufacturer’s support system to automatically download software or
firmware fixes.
-
+
-
+
-
+
Platform Event Log Format, Logical Resource
Identification section
-
+
Platform Event Log Format, Version 6, Logical
Resource Identification Section
@@ -2042,12 +2042,12 @@
-
+
-
+
Platform Event Log Format, Primary SRC Section
-
+
Platform Event Log Format, Version 6, Primary SRC
Section
@@ -2352,7 +2352,7 @@
-
+
Platform Event Log Format, Version 6, FRU Call-out
Structure
@@ -2582,68 +2582,68 @@
-
+
FRU Replacement or Maintenance Procedure
Priority
-
+
This field defines the service priority of the specific call-out,
i.e., replacing the FRU part number or performing the maintenance
procedure ID as given in the FRU/Procedure Identity substructure. Here
are the priority descriptions:
-
+
-
- 'H' = High priority and
+
+ 'H' = High priority and
mandatory call-out. Replacing the FRU (or
performing the maintenance procedure) is mandatory. If multiple call-outs
with 'H' priority are given, all must be replaced or performed as a
group.
-
+
-
- 'M' = Medium priority.
+
+ 'M' = Medium priority.
Replacing the FRU (or performing
maintenance procedure) with 'M' priority one at a time in the order given
after all call-outs prior to this one, if present, are performed.
-
+
-
- 'A' = Medium priority group A
+
+ 'A' = Medium priority group A
(1st group). Replacing all the FRUs
with 'A' priority as a group after all call-outs prior to this group, if
present, are performed.
-
+
-
- 'B' = Medium priority group B
+
+ 'B' = Medium priority group B
(2nd group). Replacing all the FRUs
with 'B' priority as a group after all call-outs prior to this group, if
present, are performed.
-
+
-
- 'C' = Medium priority group C
+
+ 'C' = Medium priority group C
(3rd group). Replacing all the FRUs
with 'C' priority as a group after all call-outs prior to this group, if
present, are performed.
-
+
-
- 'L' = Low priority. After
+
+ 'L' = Low priority. After
performed all the prior call-outs, if
present, and problem still persists, replacing the FRU with this priority
one at a time in the order given.
-
+
-
+
The list of FRU/Procedure call-outs in the “call-out”
subsection of the SRC structure must be in order as defined above, i.e.
High, Medium, Low. 'M' has the same medium priority level as 'A', 'B', or
@@ -2651,31 +2651,31 @@
'C'. A group call-out must be contiguous in the list. Within the medium
priority level, follow the call-out order in the list A list without High
or Medium priority is also valid.
-
+
-
+
Failing Component Type Description
-
+
-
+
Normal Hardware FRU:
Hardware FRU in the platform which the
platform firmware or code can positively identify, and its VPD contains
the part number and associated information.
-
+
-
+
Code FRU:
Some layer of platform firmware or OS code is
suspected. The procedure ID field provides additional information about
which code(s) is/are the potential problem.
-
+
-
+
Configuration error:
The problem may be related to how hardware
or code is configured. For example, an adapter is plugged in a slot that
@@ -2683,21 +2683,21 @@
reason to use one of these is if the analysis can provide more
information to the customer and service provider by giving a location
code.
-
+
-
+
Maintenance procedure required:
Further isolation of the problem
is required by performing the procedure as identified in the Procedure ID
field. Procedures are designed to help to isolate problems and guide the
service provider through identifying which FRUs to replace in which
order.
-
+
-
- Symbolic FRU: Used for a single
+
+ Symbolic FRU: Used for a single
FRU where the analysis code knows
exactly what the part is but there is no part number, or the part number
cannot be pulled from VPD, or when there is something special (like a
@@ -2705,41 +2705,41 @@
or FRUs without VPD (so a part number cannot be filled in). The term
“Symbolic” simply means “not an actual part
number”.
-
+
-
- External FRU: A failing part(s)
+
+ External FRU: A failing part(s)
which is/are not in the system,
e.g. attached storage sub-system, network hubs/switches, external drives
like CD/DVD boxes.
-
+
-
- External Code: Code not running
+
+ External Code: Code not running
in the platform but is the
potential source of the error. This could be something like storage
subsystem code or even another system in the same cluster.
-
+
-
- Tool FRU: This is a special
+
+ Tool FRU: This is a special
tool that will be required by one of
the FRUs in the list. Tools are only added as FRUs when they are not part
of the CE tool kit and therefore the repair action could be delayed if
the CE did not know to bring it. Examples are Optical Cleaning Kits for
fiber channel, and special tools for torque or reach or weight
considerations.
-
+
-
+
-
+
-
+
Platform Event Log Format, Dump Locator Section
@@ -2918,12 +2918,12 @@
-
+
-
+
Platform Event Log Format, EPOW Section
-
+
Platform Event Log Format, Version 6, EPOW
Section
@@ -3079,12 +3079,12 @@
-
+
-
+
Platform Event Log Format, IO Events Section
-
+
Platform Event Log Format, Version 6, IO Events
Section
@@ -3233,7 +3233,8 @@
0x04 = Node off-line
0x05 = platform-dump-max-size change
0x08 = Generic Notification
- 0x09 = NVDIMM status change
+ 0x09 = Platform protection of NVDIMM contents enabled
+ 0x0A = Platform protection of NVDIMM contents disabled
All other values = Reserved
@@ -3263,7 +3264,7 @@
For the platform-dump-max-size change I/O-Event Sub-Type:
8 bytes for the new value of the platform-dump-max-size system
parameter (specifying the sum (in bytes) of the maximum size of
- each unique platform dump type that the
+ each unique platform dump type that the
ibm,platform-dump RTAS call could
return).
@@ -3275,13 +3276,13 @@
-
+
-
+
Platform Event Log Format, Failing Enclosure
MTMS
-
+
Platform Event Log Format, Version 6, Failing
Enclosure MTMS
@@ -3395,25 +3396,25 @@
The Failing Enclosure Machine Type, Model, and Serial Number (MTMS)
that is associated with the error is important for service and
support.
-
+
The source of information for the MTMS fields varies according to
the following:
-
+
For CEC errors, it is the CEC enclosure MTMS.
-
+
For errors in I/O enclosures (drawers and towers) that have their
own MTMS and are sold as separate MTMS from the CEC, we use the I/O
Drawer MTMS.
-
+
For I/O enclosures that were sold as a feature, this section
contains the Feature Code and Serial Number of the I/O enclosure. When
the Feature Code is used, it is left justified in the Machine Type and
@@ -3421,12 +3422,12 @@
-
+
-
+
Platform Event Log Format, Impacted Partitions
-
+
Platform Event Log Format, Version 6, Impacted
Partitions
@@ -3568,17 +3569,17 @@
-
+
This section describes partitions that are impacted by an error.
When this section is supplied, the partitions in this list (and only
these partitions) are notified of the error.
-
+
-
+
Platform Event Log Format, Failing Memory
Address
-
+
Platform Error Event Log Format, Version 6, Failing
Memory Address
@@ -3716,7 +3717,7 @@
-
+
UE Error Information
@@ -4060,17 +4061,17 @@
-
+
For an error log that has the machine check interrupt section
filled out, the platform is not required to provide the date and time
stamp in the main-a section. The fields will be binary zeroes if the date
and time stamp is not provided.
-
+
Platform Event Log Format, Hotplug Section
-
+
Platform Error Event Log Format, Version 6, Hotplug Section
@@ -4100,7 +4101,7 @@
2
- Section ID: A two-ASCII character field which uniquely
+ Section ID: A two-ASCII character field which uniquely
identifies the type of section. Value = “HP”.
@@ -4112,7 +4113,7 @@
2
- Section length: Length in bytes of the section, including
+ Section length: Length in bytes of the section, including
the section ID.
@@ -4221,14 +4222,14 @@
1
- 0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug
- Identifier, the OS should take steps to verify the entirety of the request can be satisfied
- before proceeding with the hotplug / unplug operations. If only a partial count can be
- satisfied, the OS should ignore the entirety of the request. If the OS cannot determine
- this beforehand, it should satisfy the hotplug / unplug request for as many of the
+ 0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug
+ Identifier, the OS should take steps to verify the entirety of the request can be satisfied
+ before proceeding with the hotplug / unplug operations. If only a partial count can be
+ satisfied, the OS should ignore the entirety of the request. If the OS cannot determine
+ this beforehand, it should satisfy the hotplug / unplug request for as many of the
requested resources as possible, and attempt to revert to the original OS / DRC state.
- 1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the
- Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible,
+ 1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the
+ Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible,
even if it cannot be satisfied for all the DRCs specified.
@@ -4240,7 +4241,7 @@
Reserved
-
+
0x0C
@@ -4251,11 +4252,11 @@
Hotplug Identifier
Variable length field depending on the Hotplug Identifier Type specified.
- For drc name, this field is a null-terminated ASCII character field containing
+ For drc name, this field is a null-terminated ASCII character field containing
the drc name of the resource to hotplug.
For drc index, this is 4 byte field with the drc index of the resource to hotplug.
For drc count, this is a 4 byte field with the number of resources to hotplug.
- For drc count indexed, this is two 4 byte fields the first being the number of resources
+ For drc count indexed, this is two 4 byte fields the first being the number of resources
to hotplug and the second being the drc index at which to start.
@@ -4269,9 +4270,9 @@
Hotplug Token
Present only if corresponding Hotplug Event Capability bit is set.
- Integer value that can be used in conjunction with other fields of the hotplug
- event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate
- hotplug event with the request which generated it for the purposes of providing
+ Integer value that can be used in conjunction with other fields of the hotplug
+ event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate
+ hotplug event with the request which generated it for the purposes of providing
feedback to the requestor, such as debugging or error information.
@@ -4280,5 +4281,5 @@
-
+