RTAS Error/Event Return Format
This section describes in detail the return value retrieved by an
RTAS call to either the
event-scan or
check-exception function.
The return value consists of a fixed part and an optional Extended
Error Report, described in the next section, which contains full details
of the error. The fixed part is intended to allow reporting the most
common problems in a simple way, which makes error detection and recovery
simple for OSs that want to implement a very simple error handling
strategy. At the same time, the mechanism is capable of providing full
disclosure of the error syndrome information for OSs which have a more
complete error handling strategy.
RTAS can return at most one return code per invocation. If multiple
conditions exist, RTAS returns them in descending order of severity on
successive calls.
Reporting and Recovery Philosophy, and Description of
Fields
All firmware implementations use a common error and event reporting
scheme, as described in detail below. It is not required that error
recovery be present in firmware implementations, nor is it required that
a high degree of error recovery or survival be undertaken by OSs. If such
behavior is desired, then specific platform-dependent handlers can be
loaded into the OS. However, this section defines return result codes and
a philosophy which can be used if aggressive error handling is
implemented in firmware. This section describes the fields of the Error
Report format, and the philosophy which should be applied in generating
return values from firmware or interpreting such return codes in an
OS.
In general, an OS would look at the Disposition field first to see
if an error has been corrected already by firmware. If not corrected to
the OS’s satisfaction, the OS would examine the Severity field.
Based on that value, and optionally on any information it can use from
the Type and other fields, the OS will make a determination of whether to
continue or to halt operations. In either case, it may choose to log
information regarding the error, using the remaining fields and optional
Extended Error Log.
The following sections describe the field values in
.
Version
This field is used only to distinguish among present and potential
future formats for the remainder of the error report. This value will be
incremented if extensions are made to the format described here. The
primary function of this field is for future OSs to identify whether an
error report may contain some (unknown at present) feature that was added
after the initial version of this specification.
Severity
This field represents the value judgment of firmware of how serious
the problem being reported should be considered by the OS.
Errors which are believed to represent a permanent hardware failure
affecting the entire system are considered “FATAL.” OSs would
not attempt to continue normal operation after receiving notice of such
an error. OSs may not even be able to perform an orderly shutdown in the
presence of a Fatal error, though they may make a policy decision to
try.
Less serious errors, but still causing a loss of data or state, are
considered “ERRORs.” In general, continuing after such an
error is questionable, since details of what has failed may not be
available, or if available, may not map nicely onto any ongoing activity
with which the OS can associate it. However, OSs may make a policy
decision (for example, based on the error Type, the Initiator, or the
Target) to continue operation after an Error.
There are some types of errors, such as parity errors in memory or
a parity error on a transfer between CPU and memory, which occur
synchronously with the current process execution context. Such errors are
sometimes fatal only to the current thread of execution; that is, they
affect only the current CPU state and possibly that of any memory
locations being currently referenced. If that context of execution is not
essential to the system’s operation (for example, if an error trap
mechanism is available in the OS and can be triggered to recover the OS
to a known state), recovery and continuation may be possible. Or at
least, since the memory of the machine is in an undamaged state, the
system may be able to be brought down in an orderly fashion. Such errors
are reported as having Severity “ERROR_SYNC”. It is OS
dependent whether recovery is possible after such an error, or whether
the OS will treat it as a fatal problem.
The “WARNING” return value indicates that a
non-state-losing error, either fully recovered by firmware or not needing
recovery, has occurred. No OS action is required, and full operation is
expected to continue unhindered by the error. Examples include corrected
ECC errors or bus transfer failures which were re-tried
successfully.
The “EVENT” return value is the mechanism firmware uses
to communicate event information to the OS. The event may have been
detected by polling using
event-scan or on the occurrence of an interrupt by
calling
check-exception. In either case, the Error Return
value indicates the event which has occurred in the Type field. See the
Type description below for a description of specific events and their
expected handling.
The “NO_ERROR” return value indicates that no error was
present. In this case, the remainder of the Error Return fields are not
valid and should not be referenced.
RTAS Disposition
An aggressive firmware implementation may choose to attempt
recovery for some classes of error so an OS can continue operation in the
face of recoverable errors. If firmware detects an error for which it has
recovery code, it attempts such action before it returns a value to the
OS (that is, the mechanisms are linked in RTAS and cannot be separately
accessed). Note that Disposition is nearly independent from Severity.
Severity says how serious an error was, and Disposition says, regardless
of severity, whether or not the OS has to even look at it. In general, an
OS will first examine Disposition, then Severity.
A return value of “FULLY RECOVERED” means that RTAS was
able to completely recover the machine state after the error, and OS
operation can continue unhindered. The severity of the problem in this
case is irrelevant, though for consistency a “FATAL” error
can never be “FULLY RECOVERED.”
A return value of “LIMITED RECOVERY” means that RTAS
was able to recover the state of the machine, but that some feature of
the machine has been disabled or lost (for example, error checking), or
performance may suffer (for example, a failing cache has been disabled).
The RTAS implementation may return further information in the extended
error log format regarding what action was done or what corrective action
failed. In general, a conservative OS will treat this return the same as
“NOT RECOVERED,” and initiate shutdown. A less conservative
OS may choose to let the user decide whether to continue or to shut
down.
A value of “NOT RECOVERED” indicates that the RTAS
either did not attempt recovery, or that it attempted recovery but was
unsuccessful.
Optional Part Presence
This is a single flag, valid only if the 32-bit Error Return value
is located in memory, which indicates whether or not an Extended Error
Log Length field and the Extended Error Log follows it in memory. It will
be set on an in-memory return result from RTAS if and only if the RTAS
call indicated sufficient space to return the Extended Error Log, and the
RTAS implementation supports the Extended Error Log.
Initiator
This field indicates, to the best ability of RTAS to determine it,
the initiator of a failed transaction. (Note that in the
“Initiator” field of
, the value “I/O”
indicates one of the defined I/O buses or IOAs. This field contains
finer-grained details of which type of I/O bus failed, if known, and
“UNKNOWN” if RTAS cannot tell.)
In many of the newer LoPAR platforms, the platform error
notification and handling flow is asynchronous to the OS and software
execution flow, therefore the context of Initiator is not applicable to
the platform firmware. In those cases, the value of “(0) Unknown or
Not Applicable” is used for Initiator. In logs created with Version
6 or later, more detailed information about the error is provided in the
Platform Event log format.
Target
If RTAS can determine it, this field indicates the target of a
failed transaction.
In many of the newer LoPAR platforms, the platform error
notification and handling flow is asynchronous to the OS and software
execution flow, therefore the context of Target is not applicable to the
platform firmware. In those cases, the value of “(0) Unknown or Not
Applicable” is used for Target. In logs created with Version 6 or
later, more detailed information about the error are provided in the
Platform Event log format.
Type
This field identifies the general type of the error or event. In
some cases (for example, INTERN_DEV_FAIL), multiple possible events are
grouped together under a common return value. In such cases,
platform-aware software may use the Extended Error Log to distinguish
them. Non-platform-aware software will generally treat all errors of a
given type the same, so it generally will not need to access the Extended
Error Log information.
In the table, the EPOW values are associated with a Severity of
EVENT. All other values will be associated with Severity values of FATAL,
ERROR, ERROR_SYNC, or WARNING, and may or may not be corrected by
RTAS.
EPOW is an event type which indicates the potential loss of power
or environmental conditions outside the limits of safe operation of the
platform. See
for more information.
The “Platform Error (224)”, introduced for Version 6,
generalizes that the error is identified by the platform and the specific
details are encoded in the Platform Event log format itself.
The “ibm,io-events (225)” defines a set of event
notifications which requires special handling by the OS. For this type of
event notification, the Platform Event Log contains the “IO
Events” section which identifies additional details associated with
the event.
The “Platform information event (226)” indicates the
return log should be logged as “Information Log”. These logs
indicate key platform events and can be used for reference
purposes.
The “Resource deallocation event (227)” indicates an
event notification to the OS that a specific hardware resource has
experienced recurring recoverable errors with a trend toward
unrecoverable. The OS should take action to deallocate the resource from
usage to prevent unrecoverable errors. For these types of event
notification, the Platform Event Log contains the “Logical Resource
Identification” section which identifies the “Logical
Entity” by Resource Type and Resource ID, associated with the
event.
The “Dump notification event (228)” indicates that a
dump file is present in the platform and is available for retrieval by
the OS. For this type of event notification, the Platform Event Log
contains the “Dump Locator” section which contains additional
event specific information.
Additional Type values will be added in future revisions of the
specification. If an OS does not recognize a particular event type, it
can examine the severity first, and then choose to ignore the event if it
is not serious.
Extended Event Log Length / Change Scope
This optional 32-bit field is present in memory following the
32-bit Event Return value if the Optional Part Presence flag is
“PRESENT”, and it indicates the length in bytes of the
Extended Event Log information which immediately follows it in memory.
The length does not include this field or the Event Return field, so it
may be zero. The field is also present for a resource change “Hot
Plug” event, such as a PRRN event, and then represents the scope of
a resource change.
RTAS Event Return Format Fixed Part
The summary portion of the error return is designed to fit into a
single 32-bit integer. When used as a data return format in memory, an
optional Length field and Extended Error Log data may follow the summary.
The fixed part contains a “presence” flag which identifies
whether an extended report is present.
In the table below, the location of each field within the integer
is included in parentheses after its name. Numerical field values are
indicated in decimal unless noted otherwise.
RTAS Event Return Format (Fixed Part)
Bit Field Name (bit
number(s))
Description, Values (Described in
)
Version (0:7)
A distinct value used to identify the architectural
version of message.
Severity (8:10)
Severity level of error/event being reported:
ALREADY_REPORTED (6)
FATAL (5)
ERROR (4)
ERROR_SYNC (3)
WARNING (2)
EVENT (1)
NO_ERROR (0)
reserved for future use (7)
RTAS Disposition (11:12)
Degree of recovery which RTAS has performed prior to
return after an error (value is FULLY_RECOVERED if no error is
being reported):
FULLY_RECOVERED(0)
Note: Cannot be used when Severity is
“FATAL”.
LIMITED_RECOVERY(1)
NOT_RECOVERED(2)
reserved for future use (3)
Optional_Part_Presence (13)
Indicates if an Extended Error Log Length and Extended
Error Log follows this 32-bit quantity in memory:
PRESENT (1): The optional Extended Error Log is
present.
NOT_PRESENT (0): The optional Extended Error Log is not
present.
Reserved (14:15)
Reserved for future use (0:3)
Initiator (16:19)
Abstract entity that initiated the event or the failed
operation:
UNKNOWN (0): Unknown or Not Applicable
CPU (1): A CPU failure (in an MP system, the specific CPU
is not differentiated here)
PCI (2): PCI host bridge or PCI IOA
Reserved -- do not reuse (3)
MEMORY (4): Memory subsystem, including any caches
Reserved -- do not reuse (5)
HOT PLUG (6)
Reserved for future use (7-15)
Target (20:23)
Abstract entity that was apparent target of failed
operation (UNKNOWN if Not Applicable): Same values as Initiator
field
Type (24:31)
General event or error type being reported:
Internal Errors:
RETRY (1): too many tries failed, and a retry count
expired
TCE_ERR (2): range or access type error in an access
through a TCE
INTERN_DEV_FAIL (3): some RTAS-abstracted device has
failed (for example, TOD clock)
TIMEOUT (4): intended target did not respond before a
time-out occurred
DATA_PARITY (5): Parity error on data
ADDR_PARITY(6): Parity error on address
CACHE_PARITY (7): Parity error on external cache
ADDR_INVALID(8): access to reserved or undefined address,
or access of an unacceptable
type for an address
ECC_UNCORR (9): uncorrectable ECC error
ECC_CORR (10): corrected ECC error
RESERVED (11-63): Reserved for future use
Environmental and Power Warnings:
EPOW(64): See Extended Error Log for sensor value
RESERVED (65-95): Reserved for future use
Reserved -- do not reuse (96-159)
Platform Resource Reassignment (160) -- includes Change
Scope in bits 32-63
Reserved for future use (through-223)
Platform Error (224) (for Version 6 or later)
ibm,io-events (225) (for Version 6 or later)
Platform information event (226) (for Version 6 or
later)
Resource deallocation event (227) (for Version 6 or
later)
Dump notification event (228) (for Version 6 or
later)
Hot-plug-events (229) (for Version 6 or later)
Vendor-specific events(230-255): Non-architected
Other (0): none of the above
Extended Event Log Length / Change Scope
(32:63)
Length in bytes of Extended Event Log information which
follows (see
) OR the scope
parameter to be input the
ibm,update-nodes RTAS to retrieve the nodes
that were changed by selected “Hot Plug”
events.
Typically, most OSs care about, and have handlers for, only a few
specific errors. Since coding of an error is unique in the above scheme,
an OS can check for specific errors, then if nothing matches exactly,
look at more generic parts of the error message. This permits generic
error message generation for the user, providing the basic information
that RTAS delivered to the OS. Platforms may provide more complete error
diagnosis and reporting in RTAS, combined with off-line diagnostics which
take advantage of the information reported from previous failures.
Version 6 Extensions of Event Log Format
RTAS General Extended Event Log Format, Version 6
The following section defines new extensions to the event log
format which are identified by a Version number 0x06 in the first byte in
the returned buffer (byte 0 of the fixed-part information). The following
tables define extended error log formats for Version 6, by which the RTAS
can optionally return detailed information to the software about a
hardware error condition. Other versions will be defined in following
sections of this chapter. This format is also intended to be usable as
residual error log data in NVRAM, so that the OS could alternatively
retrieve error data after an error event which caused a reboot.
Platforms indicate the maximum length of the error log buffer in
the
“rtas-error-log-max” RTAS property in the
OF device tree, so that the OS can allocate a buffer large enough to hold
the extended error log data when calling the RTAS
event-scan or
check-exception functions. If the allocated buffer is
not large enough to hold all the error log data, the data is truncated to
the size of the buffer.
Requirement
and
require that four bytes of the
vendor-specific format contain a unique identifier for the company that
has defined the format. The description of the “name” string
in
provides alternatives for
defining this identifier. Examples of these unique identifiers include
stock ticker labels and Organizationally Unique Identifiers (OUIs). Since
the different options in IEEE 1275 provide different guarantees of
uniqueness and different identifier lengths, the company should use its
best judgement in selecting a unique identifier that fits the four
character field. The length of this field is limited to 4 bytes to
conserve available log data space. As an example, if Allied Information
Monitoring (a fictional name for the purposes of this example) were to
create a vendor-specific log format 12, then bytes 12-15 of such a log
may contain “AIM<NULL>”.
This identifier is intended to apply to the company that defines
the specific format, and may be used by other companies that wish to be
compatible with that format. For example, if another company wanted to
take advantage of existing support in one of the OSs by using an
AIM-specific error log format for logs generated on their own platform,
their log would have to contain an identifier of
“AIM<NULL>”.
R1--1.
Platforms which
support Version 6 of the Extended Event Log Format must do so by
including a 0x06 value in the first byte of the RTAS Event Return Format
(Fixed Part) and using the formats described in
(and all subsections under that
section).
Software Implementation Note: OSs running on
platforms which support Version 6 of the Extended Event Log Format must
ensure that the length parameter passed in the
event-scan RTAS call be at least 2 KB.
R1--2.
If the length parameter on the RTAS
event-scan call for returning data using Version 6 of
the Extended Event Log Format is insufficient to return all the data the
platform would otherwise make available, the platform must truncate the
data by eliminating optional sections entirely rather than truncating a
section.
R1--3.
All event logs returning a Version 6
Platform Event Log format must include the Main-A and Main-B Sections.
Other sections are optional depending on the specific event type as
specified in Requirement
.
R1--4.
The following
sections must be provided as indicated:
For the Platform error Type, the Primary Service Reference Code
(SRC) section must be provided.
For the ibm,io-events Type, the IO Events section must be
provided.
For the Resource deallocation event Type, the Logical Resource
Identification section must be provided.
For the Dump notification event Type, the Dump Locator section
must be provided.
For the EPOW Type, the EPOW section must be provided.
For the HOTPLUG Type, the Hotplug section must be provided.
Software Implementation Note: All fields in the
Platform Event Log marked “Platform specific information” or
“Other platform specific information sections” contain
information reserved for platform or platform Service Application use
only. That information is not defined in this document. Information in
these fields should be ignored by the OS.
Software Implementation and Architecture Note: All
fields currently marked “Reserved” are set to zero by RTAS
and are ignored by the OS. The reserved values in the defined fields in
the Platform Event Log may be defined in the future in this architecture
document for platform specific usage without change to this
architecture.
RTAS General Extended Event Log Format, Version 6
Byte
Bit
Description
0
0
1 = Log Valid
1
1 = Unrecoverable Error
2
1 = Recoverable (correctable or successfully retried)
Error
3
1 = Unrecoverable Error, Bypassed - Degraded operation
(e.g. CPU/memory taken off-line, bad cache bypassed,
etc.)
4
1 = Predictive Error - Error is recoverable, but
indicates a trend toward unrecoverable failure (e.g.
correctable ECC error threshold, etc.)
5
1 = “New” Log (always 1 for data returned
from RTAS)
6
1 = Big-Endian
7
Reserved
1
0:7
Reserved
2
0
Set to 1 - (Indicating log is in PowerPC format)
1:3
Reserved
4:7
Log format indicator, defined format used for byte
12-2047:
0-13, 15 Reserved
14: Platform Event Log
3
0:7
Reserved
4-11
Reserved
12-15
Company identifier of the company that has defined the
format for this vendor specific log type.
16-2047
Detail vendor specific log data. If byte 2, bits 4:7,
above, are a value of 14 (Platform Event Log) and bytes 12-16
are “IBM ”, then see
for the content of
this field.
Platform Event Log Format, Version 6
This format is used when byte 2, bits 4:7, of the RTAS General
Extended Event Log Version 6 are a value of 14 (Platform Event
Log).
Overview of Platform Event Log Format, Version
6
Byte
Length in Bytes
Description
12-15
4
Contains ASCII characters
“IBM<NULL>”.
16-63
48
Main-A section (ID = 'PH'). Required section. See
for the
format.
64-87
24
Main-B section (ID = 'UH'). Required, always follow
Main-A section. See
for the
format.
88-103
16
Logical Resource Identification section (ID = 'LR').
Optional, present only for Resource deallocation event
notification. If present, this section always follows Main-B
section. See
for the
format.
104-
80+
optional FRU call out sub-section
Primary SRC section (ID = 'PS'). Required for
“Platform Error” event type, optional for other
event types. If present, this section always follows Main-B
section. See
for the
format.
64
Dump Locator section (ID = 'DH') Optional, present only
for dump event notification. If present, this section follows
Main-B or Primary SRC section. See
for the
format.
20
EPOW section (ID = 'EP'). Optional, present only for
“EPOW” interrupt event notification. If present,
this section follows Main-B section. See
for the
format.
Variable
IO Events section (ID = 'IE'). Optional, present only for
“ibm,io-events” interrupt event notification. If
present, this section follows Main-B section. See
for the
format.
28
Failing Enclosure MTMS section (ID = 'MT'). Required for
errors only. If present, this section follows Main-B section or
Primary SRC. See
for the
format.
28
Impacted partition description section (ID = 'LP'').
Required for errors only. If present, this section follows
Main-B section or Primary SRC
40
Machine Check Interrupt section (ID = 'MC'). Optional for
“Platform Error” event types with ERROR_SYNC
severity caused by a machine check interrupt. If present, this
section follows the Main-B. See
.
???
Hotplug Section (ID = “HP”). Optional, present only for
Hotplug event notification. If present, this section follows
Main-B section. See .
...- 2047
Variable
Other platform specific information sections.
Optional.
Platform Event Log Format, Main-A Section
Platform Event Log Format, Version 6, Main-A
Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'PH'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 48
0x04
1
Section Version
0x05
1
Section sub-type
0x06
2
Creator Component ID
0x08
4
Log creation date in BCD format: YYYYMMDD, where YYYY =
year, MM = month 01 - 12, DD = day 01 - 31.
0x0C
4
Log creation time in BCD format: HHMMSS00, where HH =
hour 00 - 23, MM = minutes 00 - 59, SS = s econds 00 -5 9, 00 =
hundredth of seconds 00 - 99.
0x10
8
Platform specific information
0x18
1
Creator ID -- subsystem creating the log entry
represented as a single ASCII character
'E' = Service Processor
'H' = Hypervisor,
'W' = Power Control
'L' = Partition Firmware
0x19
2
Reserved
0x1B
1
Section count -- number of sections comprising log entry,
including this section
0x1C
4
Reserved
0x20
8
Platform specific information
0x28
4
Platform Log ID (PLID)
Unique identifier for a single event. Note that it is
possible for multiple log entries to be made for a single
error/event. The entries are linked to the same event by using
the same PLID.
0x2C
4
Platform specific information
Platform Event Log Format, Main-B Section
Platform Event Log Format, Version 6, Main-B
Section
Byte
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'UH'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 24
0x04
1
Section Version
0x05
1
Section subtype
0x06
2
Creator Component ID
0x08
1
Subsystem ID: For error events, this is the failing
subsystem. For non-error events, this is the subsystem
associated with the event.
0x10 - 0x1F = Processor subsystem including internal
cache
0x20 - 0x2F = Memory subsystem including external
cache
0x30 - 0x3F = I/O subsystem (hub, bridge, bus)
0x40 - 0x4F = I/O adapter, device and peripheral
0x50 - 0x5F = CEC hardware
0x60 - 0x6F = Power/Cooling subsystem
0x70 - 0x79 = Others subsystem
0x7A - 0x7F = Surveillance Error
0x80 - 0x8F = Platform Firmware
0x90 - 0x9F = Software
0xA0 - 0xAF = External environment
0xB0 - 0xFF = Reserved
0x09
1
Platform specific information
0x0A
1
Event/Error Severity (see additional description
following the table)
0x00 = Informational or non- error Event. This field must
be 0x00 for non-error event. Use Event Sub-type field to
specify unique event.
0x1X = Recovered Error
0x10 = Recovered Error, general
0x14 = Recovered Error, spare capacity utilized
0x15 = Recovered Error, loss of entitled capacity
0x2X = Predictive Error
0x20 = Predictive Error, general
0x21 = Predictive Error, degraded performance
0x22 = Predictive Error, fault may be corrected after
platform re-boot
0x23 = Predictive Error, fault may be corrected after
boot, degraded performance
0x24 = Predictive Error, loss of redundancy
0x4X = Unrecoverable Error
0x40 = Unrecoverable Error, general
0x41 = Unrecoverable Error, bypassed with degraded
performance
0x44 = Unrecoverable Error, bypassed with loss of
redundancy
0x45 = Unrecoverable Error, bypassed with loss of
redundancy and performance
0x48 = Unrecoverable Error, bypassed with loss of
function
0x6X = Error on diagnostic test
0x60 = Error on diagnostic test, general
0x61 = Error on diagnostic test, resource may produce
incorrect results
All other values = reserved
0x0B
1
Event Sub-Type (primarily used when Event Severity =
0x00, see additional description following the table)
0x00 = not applicable.
0x01 = Miscellaneous, Information Only
0x08 = Dump Notification (Dump may also be reported on
Error event)
0x10 = Previously reported error has been corrected by
system
0x20 = System resources manually deconfigured by
user
0x21 = System resources deconfigured by system due to
prior error event
0x22 = Resource deallocation event notification
0x30 = Customer environmental problem has returned to
normal
(e.g. input power restored, ambient temperature back
within limits)
0x40 = Concurrent Maintenance Event
0x60 = Capacity Upgrade Event
0x70 = Resource Sparing Event
0x80 = Dynamic Reconfiguration Event (generated by
RTAS)
0xD0 = Normal system/platform shutdown or powered
off
0xE0 = Platform powered off by user without normal
shutdown (abnormal power off)
All other values = reserved
0x0C
4
Platform specific information
0x10
2
Reserved
0x12
2
Error Action Flags (see additional description following
the table)
bit 0 (0x8000) = 1, Service Action (customer
notification) Required
bit 1 (0x4000) = 1, Hidden Error - exclusive with SA
Required (bit 0)
bit 2 (0x2000) = 1, Report Externally (send to HMC and
hypervisor)
bit 3 (0x1000) = 1, Don't report to hypervisor (only
report to HMC)
(only meaningful when (bit 2) Report Externally is
set)
bit 4 (0x0800) = 1, Call Home Required
(only valid if (bit 0) SA Required is set)
bit 5 (0x0400) = 1, Error Isolation Incomplete. Further
analysis required.
bit 6 (0x0200) = 1, Deprecated.
bit 7 (0x0100) = 1, Reserved
bit 8, 9 = Platform specific information
bit 10-15 = Reserved
0x14
4
Reserved
Error/Event Severity
This field indicates the severity of the error event and the impact
of the error to the platform (if applicable).
Non-error or Informational Event:
This value indicates an event
that is a non-error event. Informational or user action event log entries
must use this value. The Event Type field provides additional event
information.
Recovered Error, general:
This value indicates an error event that
has been automatically recovered or corrected by the platform hardware
and/or firmware, e.g. ECC, internal spare or redundancy, cache line
delete, boot time array repair, etc. No service action is required for
this type of error. In general, when this value is used, the Error Action
Flags has the value of “Hidden Error”. An event log with this
value is used primarily for error thresholding design and code debug or
as a record to indicate error frequency or trend.
Recovered Error, spare capacity utilized:
This value
indicates that an error on a resource has been recovered by utilizing
another resource not currently assigned for use (spare). The failing
component is to be considered permanently in an error state. For example,
a faulty instruction on one processor may be checkpointed and loaded into
a spare processor, continuing the operations of the faulty one. In this
case the failing component is considered permanently in an error
state.
Recovered Error, loss of entitled capacity:
This value indicates that an error on a resource has been recovered by
utilizing another resource already in use by the system. The failing
component is to be considered permanently in an error state. This results
in a loss of capacity in the partition that receives the error. For
example, a processor already in use may take over the operations of a
faulty one. Loss of the faulty processor in the system then results in
less capacity being available to the partition receiving the error event.
Typically this event would have an event sub-type of “Resource
deallocation event notification” and the revised amount of entitled
capacity would be found in the Logical Resource Identification Section,
Entitled Capacity field.
Predictive Error, general:
This value indicates an event that has
been automatically recovered or corrected by the platform hardware and/or
firmware. However, the frequency of the errors indicates a trend toward
(or potential) platform unrecoverable error. A deferred service or repair
action is required. The automatic platform recovery actions have no
impact to system performance (e.g. ECC, CRC, etc.), or the impact is
unknown.
Predictive Error, degraded performance:
This value indicates an
error event that has been automatically recovered or corrected by the
platform hardware and/or firmware. However, the frequency of the errors
(i.e. over threshold) indicates a trend toward (or potential) platform
unrecoverable error. A deferred service or repair action is required. The
automatic platform recovery actions are impacting/degrading system
performance.
Predictive Error, fault may be corrected after
platform re-boot:
This value indicates an error event that has been automatically recovered
or corrected by the platform hardware and/or firmware. However, the
frequency of the errors (i.e. over threshold) indicates a trend toward
(or potential) platform unrecoverable error. A deferred service or repair
action is required. The hardware fault may be corrected after platform
re-boot as part of the repair action. If the fault cannot be corrected
after re-boot, then a part replacement is required. The automatic
platform recovery actions have no impact to system performance (e.g. ECC,
CRC, etc.), or the impact is unknown.
Predictive Error, fault may be corrected
after platform re-boot, degraded performance:
This value indicates an error event that has been
automatically recovered or corrected by the platform hardware and/or
firmware. However, the frequency of the errors (i.e. over threshold)
indicates a trend toward (or potential) platform unrecoverable error. A
deferred service or repair action is required. The hardware fault may be
corrected after platform re-boot as part of the repair action. If the
fault cannot be corrected after re-boot, then a part replacement is
required. The automatic platform recovery actions are impacting/degrading
the system performance.
Predictive Error, loss of redundancy:
This value indicates an error
event that has been automatically recovered or corrected by the platform
hardware and/or firmware. However, the frequency of the errors (i.e. over
threshold) caused a loss in hardware redundancy. Future error in this
subsystem may causes platform unrecoverable error. A deferred service or
repair action is required to restore redundancy. The loss of redundancy
may or may not impact system performance.
Unrecoverable Error, general:
This value indicates an error event
that is unrecoverable or uncorrectable by the platform hardware and/or
firmware. The hardware or platform resource with the error cannot be
deconfigured from the system. If the error is intermittent or soft, the
platform may be able to re-boot successfully and resume. A service or
repair action is required as soon as possible to correct the
error.
Unrecoverable Error, bypassed with degraded
performance: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
resource with the error has been deconfigured from the system. The
platform can be IPLed or re-IPLed with the error bypassed. System
performance is degraded due to the deconfigured platform resource(s) e.g.
processor, cache, memory, etc. A deferred service or repair action is
required.
Unrecoverable Error, bypassed with loss
of redundancy: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
resource with the error can be deconfigured from the system. The platform
can be IPLed or re-IPLed with the error bypassed. The deconfigured
platform resource(s) resulted in loss of redundancy (e.g. Redundant FSP
with static fail-over) with no loss of system performance. A deferred
service or repair action is required.
Unrecoverable Error, bypassed with loss
of redundancy + performance:
This value indicates an error event that is unrecoverable or
uncorrectable by the platform hardware and/or firmware. However, the
hardware or platform resource with the error can be deconfigured from the
system. The platform can be IPLed or re-IPLed with the error bypassed.
The deconfigured platform resource(s) resulted in loss of redundancy and
system performance. A deferred service or repair action is
required.
Unrecoverable Error, bypassed with loss
of function: This value
indicates an error event that is unrecoverable or uncorrectable by the
platform hardware and/or firmware. However, the hardware or platform
resource with the error can be deconfigured from the system. The platform
can be IPLed or re-IPLed with the error bypassed. The deconfigured
platform resource(s) resulted in loss of platform or system function. A
deferred service or repair action is required.
Error on diagnostic test, general:
This value indicates an error
event that is detected during a diagnostic test. Impact to the system is
undefined or unknown.
Error on diagnostic test, resource may
produce incorrect results:
This value indicates an error event that is detected during a diagnostic
test. The error may produce incorrect computational results (e.g.
processor floating point unit test error).
Event Sub-Type
This field provides additional information on the non-error event
type.
Not applicable:
This value is used when the event is associated
with an error. Error/Event Severity field and SRC section provide
additional error information.
Miscellaneous, Information Only:
This value is used when the event
is “for information only” or the event description doesn't
fit into any other defined values in this field.
Dump Notification:
This value is used by the hypervisor or
partition firmware as a “Dump Notification” event to the OS
that a dump file is present in the platform for retrieval by the OS. This
value is used by the HMC as a “Dump Notification” event to
the Service Application to indicate a dump file is present for
transmission to the manufacturer.
Previously reported error has been
corrected by system: This value
is used by the platform firmware to indicate that the error event that
was previously reported has been corrected by the platform. On a
subsequent platform boot, this event type is logged to indicate that the
array was successfully repaired.
System resources manually deconfigured
by user: This value is used
by the platform firmware to indicate that a subset of platform
resource(s) was/were deconfigured due to user's request (e.g. via
platform ASM menu). The deconfigured resource(s) is/are not associated
with error detected by the platform. The event is a reminder to the user
that the platform is running with partial capacity.
Note: The platform
provides this user option for platform performance testing
purpose.
System resources deconfigured by
system due to prior error event:
This value is used by the platform firmware to indicate that the platform
is IPLed with resource(s) deconfigured due to error detected and reported
previously. The event is a reminder to the user that the platform
requires service.
Resource deconfiguration notification:
This value is used by
partition firmware as an “Event Notification” to the OS that
a specified resource (e.g. processor, memory page, etc.) currently used
by the OS should be deallocated due to predictive error. A Logical
Resource Identification section is included in the event log to indicate
the Resource Type and ID.
Customer environmental problem has
returned to normal: This value
is used by the platform firmware to indicate that a customer
environmental problem (e.g. utility power, room ambient temperature,
etc.) detected and reported previously, has returned to normal.
Concurrent Maintenance:
This value is used by the platform firmware
to indicate any non-error event associated with concurrent maintenance
activity.
Capacity Upgrade Event:
This value is used by the platform firmware
to indicate any non-error event associated with capacity upgrade
activity.
Resource Sparing Event:
This value is used by the platform firmware
to indicate any non-error event associated with platform resource sparing
activity.
Dynamic Reconfiguration Event:
This value is used by the partition
firmware to indicate any significant but non-error event associated with
dynamic reconfiguration activity.
Implementation Note: Due to limited
platform storage resource, non-error event log associated with a logical
partition will be reported to the OS but may not be stored in the
platform.
Normal system/platform shutdown or
powered off: This value is used
by the platform firmware to indicate any non-error event associated with
normal system/platform shutdown or powered off activity initiated by the
user.
Platform powered off by user without
normal shutdown (abnormal powered off):
This value is used by the platform firmware to indicate
that the platform is abnormally powered off by the user.
Error Action Flags
The following are the definitions of the actions taken for the
various Error Action Flags.
Report Externally -
This flag instructs the service processor
(error logger component) to send the error to the service application
(e.g. service focal point(s) or FNM error analyzer). If this flag is set,
the SP always sends the error:
To the “managing HMC(s)” if one (or multiple)
exists.
And to the hypervisor (unless the “Don't report to
hypervisor” flag is also set).
Service Action Required -
This flag instructs the Service
application that some service action is required by either the customer
or by the manufacturer’s service personnel. This is equivalent to
saying Customer Notification is required. Contrast this flag with the
“Call Home Required” flag.
Call Home Required -
This flag indicates that the error requires
service and a Call Home Operation is to be performed. There are
additional policies used in combination with this flag: what subsystem
performs the Call Home, what is sent and where it is sent.
Hidden Error -
This flag allows errors to be placed in a
partition's OS error log, but still remain hidden from the customer. This
is a legacy function and the partition firmware for must filter errors
marked “Hidden” and not forward these errors marked with this
flag to the OS. Note that this flag has no impact on the SP reporting
errors to either the HMC or hypervisor or for the hypervisor reporting
errors to partitions.
Don't report Error to hypervisor -
While a partition is booting and
before it is functional (e.g. no OS error logging available), partition
errors may be sent through the hypervisor to the Service Processor).
These partition errors (and only partition errors) may be marked with
this flag to indicate that they need not be sent back to the hypervisor.
This is due to the error scope being limited to the failing partition and
the hypervisor has already taken the appropriate actions.
Incomplete Information for Error
Isolation - Some errors are not
contained to a single enclosure and require error isolation from an
entity with broader system view / scope.
Software Error -
This flag is used by the partition error logger to
indicate to the error is most likely to be caused by the software. When
both Software Error and Hardware Error flags are set, the error is caused
by either software or hardware. The Software Error and Hardware Error
flags are used to trigger the manufacturer’s support system to
automatically download software or firmware fixes.
Hardware Error -
This flag is used by the partition error logger to
indicate to the error is most likely to be caused by the hardware. The
Software Error and Hardware Error flags are used to trigger the
manufacturer’s support system to automatically download software or
firmware fixes.
Platform Event Log Format, Logical Resource
Identification section
Platform Event Log Format, Version 6, Logical
Resource Identification Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'LR'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 20
0x04
1
Section Version
0x05
1
Section subtype
0x06
2
Creator Component ID
0x08
1
Resource Type
0x10: Processor
0x11: Shared processor
0x40: Memory page
0x41: Memory LMB
All other values = reserved
0x09
1
Reserved
0x0A
2
Entitled Capacity: Hundredths of a CPU (only used for
Resource Type = Shared processor, value = 0x0000 for
others)
0x0C
4
Logical CPU ID: for resource type = processor
DRC Index, for resource type = memory LMB
Memory Logical Address (bit 0-31), for resource type =
memory page
0x10
4
Memory Logical Address (bit 32-64)), for resource type =
memory page
Platform Event Log Format, Primary SRC Section
Platform Event Log Format, Version 6, Primary SRC
Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'PS'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 80 + optional FRU call out sub section
0x04
1
Section Version
0x05
1
Section Subtype
0x06
2
Creator component ID
0x08
1
SRC Version
0x09
1
SRC Flags
bit 0:6 = Platform specific information
bit 7 = 1: Additional/Optional sub-sections
present
0x0A
6
Platform specific information
0x10
4
Extended Reference Code hex data word 2 (required)
0x14
4
Extended Reference Code hex data word 3 (optional)
0x18
4
Extended Reference Code hex data word 4 (optional)
0x1C
4
Extended Reference Code hex data word 5 (optional)
0x20
4
Extended Reference Code hex data word 6 (optional)
0x24
4
Extended Reference Code hex data word 7 (optional)
0x28
4
Extended Reference Code hex data word 8 (optional)
0x2C
4
Extended Reference Code hex data word 9 (optional)
0x30
32
Primary Reference Code: 32 byte ASCII character
(required)
Additional/Optional Sub section for FRU call out (present
only for “Platform Error” event type)
0x00
1
Sub section ID = C0 for FRU call out
0x01
1
Platform specific information
0x02
2
Length of sub section: expressed in # of words (4 bytes),
from Sub section ID field
FRU call out structure length
FRU call out 1 (see FRU call out structure format
below)
FRU call out 2 (call out 2-10 are optional)
...
FRU call out 10 (maximum)
Platform Event Log Format, Version 6, FRU Call-out
Structure
Offset
Length in Bytes
Description
0x00
1
Call-out Structure length, in bytes including all fields,
including this one.
0x01
1
Call-out Type / Flags
bits 0-3: Call-out structure type
0b0010 = this structure
bit 4 = 1 FRU Identity (ID) Substructure field included
in this FRU Call-out structure
bit 5 = 1 Other platform-only use substructure field
present following FRU ID substructure
bit 6-7 = 0b11: Other platform-only use substructure
field present following FRU ID substructure
0x02
1
FRU Replacement or Maintenance Procedure Priority
(expressed as an ASCII character, see additional description
following the table)
'H' = High priority and mandatory call-out.
'M' = Medium priority.
'A' = Medium priority group A (1st group).
'B' = Medium priority group B (2nd group).
'C' = Medium priority group C (3rd group).
'L' = Low priority.
0x03
1
Length of Location Code field - must be a multiple of
4.
0x04
variable
max=80
Location Code
NULL terminated ASCII string. May be up to 80 characters
including the NULL. Padded with extra NULLs to 4-byte
boundary.
FRU Identity Substructure follow:
0x00
2
Substructure Type (2 ASCII Characters)
'ID' = FRU Identity Substructure
0x02
1
Substructure length (variable, several optional fields -
see flags below)
0x03
1
Flags
bits 0-3: Failing component Type (see additional
description following the table)
0b0000: reserved
0b0001: “normal” hardware FRU
0b0010: code FRU
0b0011: configuration error, configuration procedure
required
0b0100: Maintenance Procedure required
0b1001: External FRU
0b1010: External code FRU
0b1011: Tool FRU
0b1100: Symbolic FRU
0b1111: Reserved for expansion
all other values reserved
bit 4 (0x08) = 0b1: FRU Stocking Part Number supplied
(mutually exclusive with bit 6)
bit 5 (0x04) = 0b1: CCIN supplied (only valid if bit 4 =
0b1)
bit 6 (0x02) = 0b1: Maintenance procedure call out
supplied (mutually exclusive with FRU p/n)
bit 7 (0x01) = 0b1: FRU Serial Number supplied (only
valid if bit 4 = 0b1)
0x04
8, if present
FRU Stocking Part Number (VPD FN keyword) or Procedure
ID
This field is present if Flags bits 4 =0b1 or Flags bits
6 =0b1.
It contains a NULL-terminated ASCII character
string.
If Flags bit 4 = 0b1, this field contains a 7ASCII
character part number
If Flags bit 6 = 0b1, this field contains a 5 ASCII
character procedure ID
0x0C
4, if present
CCIN (VPD CC keyword) (optional, only supplied if Part
Number also supplied)
This field is present if Flags bit 5 = 0b1. It contains
the CCIN of the failing FRU (VPD CC keyword), represented as 4
ASCII characters (not a NULL-terminated string).
12, if present
FRU Serial Number (VPD SE Keyword) (optional)
This field is present if Flags bit 7 = 0b1. It contains
the serial number of the failing FRU (VPD SE keyword),
represented as a 12 ASCII characters (not a NULL-terminated
string).
End of FRU Identify Substructure
variable
Other platform used only substructure field
FRU Replacement or Maintenance Procedure
Priority
This field defines the service priority of the specific call-out,
i.e., replacing the FRU part number or performing the maintenance
procedure ID as given in the FRU/Procedure Identity substructure. Here
are the priority descriptions:
'H' = High priority and
mandatory call-out. Replacing the FRU (or
performing the maintenance procedure) is mandatory. If multiple call-outs
with 'H' priority are given, all must be replaced or performed as a
group.
'M' = Medium priority.
Replacing the FRU (or performing
maintenance procedure) with 'M' priority one at a time in the order given
after all call-outs prior to this one, if present, are performed.
'A' = Medium priority group A
(1st group). Replacing all the FRUs
with 'A' priority as a group after all call-outs prior to this group, if
present, are performed.
'B' = Medium priority group B
(2nd group). Replacing all the FRUs
with 'B' priority as a group after all call-outs prior to this group, if
present, are performed.
'C' = Medium priority group C
(3rd group). Replacing all the FRUs
with 'C' priority as a group after all call-outs prior to this group, if
present, are performed.
'L' = Low priority. After
performed all the prior call-outs, if
present, and problem still persists, replacing the FRU with this priority
one at a time in the order given.
The list of FRU/Procedure call-outs in the “call-out”
subsection of the SRC structure must be in order as defined above, i.e.
High, Medium, Low. 'M' has the same medium priority level as 'A', 'B', or
'C' and a call out with 'M' priority can precede or follow 'A', 'B' or
'C'. A group call-out must be contiguous in the list. Within the medium
priority level, follow the call-out order in the list A list without High
or Medium priority is also valid.
Failing Component Type Description
Normal Hardware FRU:
Hardware FRU in the platform which the
platform firmware or code can positively identify, and its VPD contains
the part number and associated information.
Code FRU:
Some layer of platform firmware or OS code is
suspected. The procedure ID field provides additional information about
which code(s) is/are the potential problem.
Configuration error:
The problem may be related to how hardware
or code is configured. For example, an adapter is plugged in a slot that
cannot support it. The FRU could be a procedure or a symbolic FRU. The
reason to use one of these is if the analysis can provide more
information to the customer and service provider by giving a location
code.
Maintenance procedure required:
Further isolation of the problem
is required by performing the procedure as identified in the Procedure ID
field. Procedures are designed to help to isolate problems and guide the
service provider through identifying which FRUs to replace in which
order.
Symbolic FRU: Used for a single
FRU where the analysis code knows
exactly what the part is but there is no part number, or the part number
cannot be pulled from VPD, or when there is something special (like a
procedure) for acquiring the FRU or working with it. Examples are cables,
or FRUs without VPD (so a part number cannot be filled in). The term
“Symbolic” simply means “not an actual part
number”.
External FRU: A failing part(s)
which is/are not in the system,
e.g. attached storage sub-system, network hubs/switches, external drives
like CD/DVD boxes.
External Code: Code not running
in the platform but is the
potential source of the error. This could be something like storage
subsystem code or even another system in the same cluster.
Tool FRU: This is a special
tool that will be required by one of
the FRUs in the list. Tools are only added as FRUs when they are not part
of the CE tool kit and therefore the repair action could be delayed if
the CE did not know to bring it. Examples are Optical Cleaning Kits for
fiber channel, and special tools for torque or reach or weight
considerations.
Platform Event Log Format, Dump Locator Section
Platform Event Log Format, Version 6, Dump Locator
Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'DH'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 64
0x04
1
Section Version
0x05
1
Section Sub-Type
0x00 = Log truncated, complete log received by another
service entity.
0x01 = FSP Dump
0x02 = Platform System Dump
0x03 = Reserved
0x04 = Power Subsystem Dump
0x05 = Platform Event Log Entry Dump (when distinguishing
between dump types, the term “Log Dump” is
typically used)
0x06 = Partition-initiated resource dump
0x07 = Platform-initiated resource dump
All other values reserved
0x0
2
Creator component ID
0x08
4
Dump ID
0x0C
1
Flags
bit 0 (0x80) = 0, Dump sent to partition
bit 0 = 1, Dump sent to HMC
bit 1 (0x40) = 0, File name in ASCII
bit 1 = 1, Dump file name is hex
bit 2 (0x20) = 1, Dump size field valid
0x0D
2
Reserved
0x0F
1
Length of OS assigned Dump ID field in bytes, must be
multiple of 4.
May be 0.
0x10
8
Dump Size
0x18
40
OS-Assigned Dump ID
As the flag field indicates, this field may either be an
ASCII string or a hex number.
When an ASCII string (AIX, Linux, HMC), this is a NULL
terminated ASCII string representing the dump file name (leaf
name only, does not include path).
Field may be up to 40 characters including the
NULL.
Platform Event Log Format, EPOW Section
Platform Event Log Format, Version 6, EPOW
Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'EP'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
0x04
1
Section Version
0x05
1
Section subtype
0x06
2
Creator Component ID
0x08
1
EPOW Sensor Value (low-order 4 bits contain the action
code).
0x09
1
EPOW Event Modifier
(low-order 4 bits contain the event modifier
value)
0x00 = Not applicable
For EPOW sensor value = 3
0x01 = Normal system shutdown with no additional
delay
0x02 = Loss of utility power, system is running on
UPS/Battery
0x03 = Loss of system critical functions, system should
be shutdown
0x04 = Ambient temperature too high
All other values = reserved
0x0A
1
Extended Modifier for Section Version 2 and higher
For EPOW Sensor Value = 3
0x00 System wide shutdown
0x01 Partition specific shutdown
0x02 - 0xFF Reserved
All other situations Reserved = 0x00
0x0B
1
Reserved
0x0C
8
Platform specific reason code
Platform Event Log Format, IO Events Section
Platform Event Log Format, Version 6, IO Events
Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'IE'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
0x04
1
Section Version
0x05
1
Section subtype
0x06
2
Creator Component ID
0x08
1
IO-Event Type:
0x01 = Error Detected
0x02 = Error Recovered
0x03 = Event
0x04 = RPC Pass Through
All other values = Reserved
0x09
1
Offset 0x10 Field Length:
For IO Event Type of RPC Pass Through, this field
specifies the length of the data field which begins at offset
0x10, otherwise the value in this field is 0. Must be a
multiple of 4 to maintain 4-byte alignment.
0x0A
1
Error/Event Scope:
0x00 = Not Applicable (use for IO-Event type 0x02, 0x03,
0x04)
0x36 = Reserved
0x37 = Reserved
0x38 = PHB
0x39 = Reserved
0x3A = Reserved
0x3B = Reserved
0x51 = Service Processor
All other values = Reserved
0x0B
1
I/O-Event Sub-Type:
0x00 = Not Applicable (use for IO-Event type 0x01, 0x02,
0x04)
0x01 = Rebalance request
0x03 = Node online
0x04 = Node off-line
0x05 = platform-dump-max-size change
0x08 = Generic Notification
0x09 = Platform protection of NVDIMM contents enabled
0x0A = Platform protection of NVDIMM contents disabled
All other values = Reserved
0x0C
4
DRC Index
0x10
0-216
For the RPC Pass Through IO Event Type: RPC data.
Variable length data. Must be padded to 4 bytes
alignment.
For the platform-dump-max-size change I/O-Event Sub-Type:
8 bytes for the new value of the platform-dump-max-size system
parameter (specifying the sum (in bytes) of the maximum size of
each unique platform dump type that the
ibm,platform-dump RTAS call could
return).
For Generic Notification I/O Event Sub-Type: Scoped Data
Generic Notification Event Section. Must be padded to 4 bytes
alignment.
Platform Event Log Format, Failing Enclosure
MTMS
Platform Event Log Format, Version 6, Failing
Enclosure MTMS
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'MT'
0x02
2
Section length: Length in bytes of the section, including
the section ID.
value = 28
0x04
1
Section Version
0x05
1
Section subtype
0x06
2
Creator Component ID
0x08
8
Machine Type and Model: 8 ASCII characters, in the form
“tttt-mmm”,
where tttt = Machine Type and mmm = Model Number
0x10
12
Serial Number:
12 ASCII characters (If less than 12 characters are used,
string is left justified (stored in the field starting with the
lowest address) and padded with NULLs.)
The Failing Enclosure Machine Type, Model, and Serial Number (MTMS)
that is associated with the error is important for service and
support.
The source of information for the MTMS fields varies according to
the following:
For CEC errors, it is the CEC enclosure MTMS.
For errors in I/O enclosures (drawers and towers) that have their
own MTMS and are sold as separate MTMS from the CEC, we use the I/O
Drawer MTMS.
For I/O enclosures that were sold as a feature, this section
contains the Feature Code and Serial Number of the I/O enclosure. When
the Feature Code is used, it is left justified in the Machine Type and
Model field.
Platform Event Log Format, Impacted Partitions
Platform Event Log Format, Version 6, Impacted
Partitions
Offset
Length
Byte 0
Byte 1
Byte 2
Byte 3
0
8
Section Header
0x10
4
Primary Partition ID
Length of LP name
(must be a multiple of 4)
Target LP Count
0x14
4
Logical Partition ID
0x18
variable
Primary Partition (LP) Name
Null terminated ASCII string, padded to 4-Byte
boundary
variable
Target LP 1
Target LP 2
and so on
(padded to a 4-Byte boundary)
This section describes partitions that are impacted by an error.
When this section is supplied, the partitions in this list (and only
these partitions) are notified of the error.
Platform Event Log Format, Failing Memory
Address
Platform Error Event Log Format, Version 6, Failing
Memory Address
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section.
value = 'MC'
0x02
2
Section length: Length in bytes of the section, including
the section ID
value = 32
0x04
1
Section Version
0x05
1
Section Subtype
0x06
2
Creator Component ID
0x08
4
FRU ID -- Identifies the FRU on which the machine check
interrupt occurred
0x0C
4
Processor ID -- identifies the physical CPU on which the
machine check occurred
0x10
1
Type of machine check interrupt
0x00 = Uncorrectable Memory Error (UE)
0x01 = SLB error
0x02 = ERAT Error
0x04 = TLB error
0x05 = D-Cache error
0x07 = I-Cache error
0x11
23
Information specific to machine check interrupt type.
This section is binary zeroes if the platform does not provide
specific information for the type of interrupt.
UE Error Information
Offset
Length in Bytes
Description
0x11
1
Type of UE
Bit 0 = 0 Permanent UE. The UE may be cleared with a DCBZ
instruction.
Bit 0 =1 Transient UE. The UE cannot be cleared with a
DCBZ instruction. The contents of the entire logical page are
not accessible for this type of UE
64 bit effective address is provided
Bit 1 = 0 64 bit effective address is not provided by the
log
Bit 1 = 1 64 bit effective address is provided by the
log. Offset 0x18 provides the effective address if this bit is
1
64 bit logical address is provided
Bit 2 = 0 64 bit logical address of logical page is not
provided by the log
Bit 2 = 1 64 bit logical address of logical page is
provided by the log. Offset 0x20 provides the logical address
of the page if this bit is 1
Bit3-4 Reserved
Bit5-7 Type of UE machine check interrupt. The value of
the field is 0b000 for a permanent UE
0b000 = Platform cannot determine the processor unit that
detected the error
0b001 = Error detected by instruction fetch unit of the
processor
0b010 = Error during page table search for instruction
fetch
0b011 = Error detected by load/store unit of the
processor
0b100 = Error detected during page table search for
load/store type of instruction
All other values are reserved.
0x12
6
Reserved
0x18
8
64 bit effective address
0x20
8
64 bit logical address
SLB Error Information
Offset
Length in Bytes
Description
0x11
1
64 bit effective address is provided
Bit 0 = 0 64 bit effective address not provided by the
log
Bit 0 =1 64 bit effective address provided by the log.
Offset 0x18 provides the effective address if bit 0 is1
Bit1-5 Reserved
Bit 6-7 Type of SLB error
0b00 = Parity error in the SLB array or on the access
path to the SLB
0b01 = Multiple hit error. There are two or more entries
in the SLB that translate the same effective address
0b10 = Multiple hit error or parity error. Platform does
not have enough information to disambiguate between the two
cases.
All other values are reserved.
0x12
6
Reserved
0x18
8
64 bit effective address
0x20
8
Reserved
ERAT Error Information
Offset
Length in Bytes
Description
0x11
1
64 bit effective address is provided
Bit 0 = 0 64 bit effective address not provided by the
log
Bit 0 =1 64 bit effective address provided by the log.
Offset 0x18 provides the effective address if bit 0 is1
Bit 1-5 Reserved
Bit 6-7 Type of ERAT error
0b01 = Parity error in the ERAT array
0b10 = Multiple hit error. There are two or more entries
in the ERAT array that translate the same effective
address
0b11 = Multiple hit error or parity error in the ERAT
array. Platform does not have enough information to
disambiguate between the two cases.
All other values are reserved.
0x12
6
Reserved
0x18
8
64 bit effective address
0x20
8
Reserved
TLB Error Information
Offset
Length in Bytes
Description
0x11
1
64 bit effective address is provided
Bit 0 = 0 64 bit effective address not provided by the
log
Bit 0 =1 64 bit effective address provided by the log.
Offset 0x18 provides the effective address if bit 0 is1
Bit 1-5 Reserved
Bit 6-7 Type of TLB error
0b01 = Parity error in the TLB array
0b10 = Multiple hit error. There are two or more entries
in the TLB that translate the same effective address
0b11= Multiple hit error or parity error in the TLB
array. Platform does not have enough information to
disambiguate between the two cases.
All other values are reserved.
0x12
6
Reserved
0x18
8
64 bit effective address
0x20
8
Reserved
For an error log that has the machine check interrupt section
filled out, the platform is not required to provide the date and time
stamp in the main-a section. The fields will be binary zeroes if the date
and time stamp is not provided.
Platform Event Log Format, Hotplug Section
Platform Error Event Log Format, Version 6, Hotplug Section
Offset
Length in Bytes
Description
0x00
2
Section ID: A two-ASCII character field which uniquely
identifies the type of section. Value = “HP”.
0x02
2
Section length: Length in bytes of the section, including
the section ID.
0x04
1
Section Version
0x05
1
Section subType
0x06
2
Creator Component ID
0x08
1
Hotplug Resource Type.
0x01 = CPU
0x02 = Memory
0x03 = SLOT
0x04 = PHB
0x05 = PCI
0x09
1
Hotplug Action
0x01 = Add
0x02 = Remove
0x0A
1
Hotplug Identifier Type
0x01 = drc name, resource is identified by drc name
0x02 = drc index, resource is identified by drc index
0x03 = drc count, number of resources to act upon
0x04 = drc count indexed, number of resources to act upon beginning at the specified drc index
0x0B
1
Bit
Hotplug Event Capability Description
0
1 = Hotplug Token Present
1
0 = Transactional Request: When using “drc count”or “drc count indexed”as the Hotplug
Identifier, the OS should take steps to verify the entirety of the request can be satisfied
before proceeding with the hotplug / unplug operations. If only a partial count can be
satisfied, the OS should ignore the entirety of the request. If the OS cannot determine
this beforehand, it should satisfy the hotplug / unplug request for as many of the
requested resources as possible, and attempt to revert to the original OS / DRC state.
1 = Non-transactional Request: When using “drc count”or “drc count indexed”as the
Hotplug Identifier, the OS should attempt to satisfy as much of the request as possible,
even if it cannot be satisfied for all the DRCs specified.
2:7
Reserved
0x0C
Variable
Hotplug Identifier
Variable length field depending on the Hotplug Identifier Type specified.
For drc name, this field is a null-terminated ASCII character field containing
the drc name of the resource to hotplug.
For drc index, this is 4 byte field with the drc index of the resource to hotplug.
For drc count, this is a 4 byte field with the number of resources to hotplug.
For drc count indexed, this is two 4 byte fields the first being the number of resources
to hotplug and the second being the drc index at which to start.
(Section Length - 4)
4
Hotplug Token
Present only if corresponding Hotplug Event Capability bit is set.
Integer value that can be used in conjunction with other fields of the hotplug
event structure (Hotplug Indentifier, Hotplug Type, etc.) to allow OS to associate
hotplug event with the request which generated it for the purposes of providing
feedback to the requestor, such as debugging or error information.