I/O Bridges and TopologiesThere will be at least one bridge in a platform which interfaces to the
system interconnect on the processor side, and interfaces to the Peripheral
Component Interface (PCI) bus on the other. This bridge is called the
PCI Host Bridge (PHB).
The architectural requirements on the PHB, as well as other aspects of the I/O
structures, PCI bridges, and PCI Express switches are defined in this chapter.
I/O Topologies and Endpoint PartitioningAs systems get more sophisticated, partitioning of various components
of the system will be used, in order to obtain greater Reliability,
Availability, and Serviceability (RAS). For example, Dynamic Reconfiguration
(DR) allows the removal, addition, and replacement of components from an
OS’s pool of resources, without having to stop the operation of that OS.
In addition, Logical Partitioning (LPAR) allows the isolation of resources used
by one OS from those used by another. This section will discuss aspects of the
partitioning of the I/O subsystem. Further information on DR and LPAR can be
found in and
.To be useful, the granularity of assignment of I/O resources to an OS
needs to be fairly fine-grained. For example, it is not generally acceptable to
require assignment of all I/O under the same PCI Host Bridge (PHB) to the same
partition in an LPARed system, as that restricts configurability of the system,
including the capability to dynamically move resources between
partitionsDynamic LPAR or DLPAR is
defined by the Logical Resource Dynamic Reconfiguration (LRDR) option. See
for more information. Assignment of all
IOAs under the same PHB to one partition may be acceptable if that I/O is
shared via the Virtual I/O (VIO) capability defined in .. To be able to partition
I/O adapters (IOAs), groups of IOAs or portions of IOAs for DR or to different
OSs for LPAR will generally require some extra functionality in the platform
(for example, I/O bridges and firmware) in order to be able to partition the
resources of these groups, or endpoints, while at the same time preventing any
of these endpoints from affecting another endpoint or getting access to another
endpoint’s resources. These endpoints (that is, I/O subtrees) that can
be treated as a unit for the purposes of partitioning and error recovery will
be called Partitionable Endpoints (PEs)A
“Partitionable Endpoint” in this architecture is not to be
confused with what the PCI Express defines as an “endpoint.” PCI
Express defines an endpoint as “a device with a Type 0x00 Configuration
Space header.” That means PCI Express defines any entity with a unique
Bus/Dev/Func # as an endpoint. In most implementations, a PE will not exactly
correspond to this unit. and this concept will be called
Endpoint Partitioning. A PE is defined by its Enhanced I/O Error Handling (EEH) domain and
associated resources. The resources that need to be partitioned and not overlap
with other PE domains include:The Memory Mapped I/O (MMIO) Load and
Store address space which is available to the PE. This is
accomplished by using the processor’s Page Table mechanism (through
control of the contents of the Page Table Entries) and not having any part of
two separate PEs’ MMIO address space overlap into the same 4 KB system
page. Additionally, for LPAR environments, the Page Table Entries are
controlled by the hypervisor.The DMA I/O bus address space which is available to the PE. This is
accomplished by a hardware mechanism (in a bridge in the platform) which
enforces the correct DMA addresses, and for LPAR, this hardware enforcement is
set up by the hypervisor. It is also important that a mechanism be provided for
LPAR such that the I/O bus addresses can further be limited at the system level
to not intersect; so that one PE cannot get access to a partition’s
memory to which it should not have access. The Translation Control Entry (TCE)
mechanism, when controlled by the firmware (for example, a hypervisor), is such
a mechanism. See for more information
on the TCE mechanism.The configuration address space of the PE, as it is made available
to the device driver. This is accomplished through controlling access to a
PE’s configuration spaces through Run Time Abstraction Services (RTAS)
calls, and for LPAR, these accesses are controlled by the hypervisor.The interrupts which are accessible to the PE. An interrupt cannot
be shared between two PEs. For LPAR environments, the interrupt presentation
and management is via the hypervisor.The error domains of the PE; that is, the error containment must be
such that a PE error cannot affect another PE or, for LPAR, another partition
or OS image to which the PE is not given access. This is accomplished though
the use of the Enhanced I/O Error Handling (EEH) option of this architecture.
For LPAR environments, the control of EEH is through the hypervisor via several
RTAS calls.The reset domain: A reset domain contains all the components of a
PE. The reset is provided programmatically and is intended to be implemented
via an architected (non implementation dependent) method.For example, through a Standard Hot Plug
Controller in a bridge, or through the Secondary Bus Reset bit in the Bridge
Control register of a PCI bridge or switch. Resetting a
component is sometimes necessary in order to be able to recover from some types
of errors. A PE will equate to a reset domain, such that the entire PE can be
reset by the ibm,set-slot-reset RTAS call. For LPAR, the
control of the reset from the RTAS call is through the hypervisor.In addition to the above PE requirements, there may be other
requirements on the power domains. Specifically, if a PE is going to
participate in DR, including DLPAR,To
prevent data from being transferred from one partition to another via data
remaining in an IOA’s memory, most implementations of DLPAR will require
the power cycling of the PE after removal from one partition and prior to
assigning it to another partition. then either the power
domain of the PE is required to be in a power domain which is separate from
other PEs (that is, power domain, reset domain, and PE domain all the same), or
else the control of that power domain and PCI Hot Plug (when implemented) of
the contained PEs will be via the platform or a trusted platform agent. When
the control of power for PCI Hot Plug is via the OS, then for LPAR
environments, the control is also supervised via the hypervisor.It is possible to allow several cooperating device drivers to share a
PE. Sharing of a PE between device drivers within one OS image is supported by
the constructs in this architecture. Sharing between device drivers in
different partitions is beyond the scope of the current architecture.A PE domain is defined by its top-most (closest to the PHB) PCI
configuration address (in the terms of the RTAS calls, the
PHB_Unit_ID_Hi, PHB_Unit_ID_Low, and config_addr ), which will be
called the PE configuration address in this architecture,
and encompasses everything below that in the I/O tree. The top-most PCI bus of
the PE will be called the PE primary bus. Determination
of the PE configuration address is made as described in .A summary of PE support can be found in . This architecture assumes that there is a
single level of bridge within a PE if the PE is heterogeneous (some
Conventional PCI Express), and these cases are shown by the shaded cells in the
table.
Conventional PCI Express PE Support Summary FunctionIOA TypePE Primary BusPCI Express PE determination
(is EEH supported for the IOA?) All Use the ibm,read-slot-reset-state2
RTAS call. PE reset All PE reset is required for all PEs and is
activated/deactivated via the ibm,set-slot-reset RTAS
call. The PCI configuration address used in this call is the PE configuration
address (the reset domain is the same as the PE domain).ibm,get-config-addr-info2RTAS call PCIExpress Required to be implemented. Top of PE domain determinationPE
configuration address is used as input to the
RTAS calls which are used for PE control, namely:
ibm,set-slot-reset,
ibm,set-eeh-option,
ibm,slot-error-detail,
ibm,configure-bridge
(How to obtain the PE configuration address) PCIExpress Use the ibm,get-config-addr-info2
RTAS call to obtain PE configuration address. Shared PE determinationIf device
driver is written for the shared PE
environment, then this may be a don’t care.
(is there more than one IOA per PE?) PCIExpress Use the ibm,get-config-addr-info2
RTAS call. PEs per PCI Hot Plug domain and PCI Hot Plug control
point PCIExpress May have more than one PE per PCI Hot Plug DR entity, but
a PE will be entirely encompassed by the PCI Hot Plug power domain. If more
than one PE per DR entity, then PCI Hot Plug control is via the platform or
some trusted platform agent.
R1--1.All platforms must implement the
ibm,get-config-addr-info2 RTAS call.R1--2.All platforms must implement the
ibm,read-slot-reset-state2 RTAS call.R1--3.For the EEH option:
The resources of one PE must not overlap the resources of another PE,
including:Error domainsMMIO address rangesI/O bus DMA address ranges (when PEs are below the same PHB)Configuration spaceInterruptsR1--4.For the EEH option:
All the following must be true relative to a PE:An IOA function must be totally encompassed by a PE.All PEs must be independently resetable by a reset domain.Architecture Note: The partitioning of PEs
down to a single IOA function within a multi-function IOA requires a way to
reset an individual IOA function within a multi-function IOA. For PCI, the only
mechanism defined to do this is the optional PCI Express Function Level Reset
(FLR). A platform supports FLR if it supports PCI Express and the partitioning
of PEs down to a single IOA function within a multi-function IOA. When FLR is
supported, if the ibm,set-slot-reset RTAS call uses FLR
for the Function 1/Function 0
(activate/deactivate reset) sequence for an IOA function, then the platform
provides the “ibm,pe-reset-is-flr” property
in the function’s node of the OF device tree, See
for more information.R1--5.The platform must own (be
responsible for) any error recovery for errors that occur outside of all PEs
(for example in switches and bridges above defined PEs).Implementation Note: As part of the error
recovery of Requirement , the platform
may, as part of the error handling of those errors, establish an equivalent EEH
error state in the EEH domains of all PEs below the error point, in order to
recover the hardware above those EEH domains from its error state. The platform
also returns a PE Reset State of 5 (PE is unavailable)
with a PE Unavailable Info non-zero (temporarily
unavailable) while a recovery is in progress.R1--6.The platform must own (be responsible for)
fault isolation for all errors that occur in the I/O fabric (that is, down to
the IOA; including errors that occur on that part of the I/O fabric which is
within a PE’s domain).R1--7.For the EEH option with the PCI
Hot Plug option: All of the following must be true: If PCI Hot Plug operations are to be controlled by the OS to which
the PE is assigned, then the PE domain for the PCI Hot Plug entity and the PCI
Hot Plug power domain must be the same.All PE domains must be totally encompassed by their respective PCI
Hot Plug power domain, regardless of the entity that controls the PCI Hot Plug
operation.R1--8.All platforms that implement the
EEH option must enable that option by default for all PEs.Implementation Notes:See and for requirements
relative to EEH requirements with LPAR.Defaulting to EEH enabled, as required by Requirement
does not imply that the platform has no
responsibility in assuring that all device drivers are EEH enabled or EEH safe
before allowing their the Bus Master, Memory Space or I/O Space bits in the PCI
configuration Command register of their IOA to be set to a 1. Furthermore, even
though a platform defaults its EEH option as enabled, as required by
Requirement does not imply that the
platform cannot disable EEH for a PE. See Requirement
for more information.The following two figures show some examples of the concept of
Endpoint Partitioning. See also for
more information on the EEH option.PCI Host Bridge (PHB) ArchitectureThe PHB architecture places certain requirements on PHBs. There
should be no conflict between this document and the PCI specifications, but if
there is, the PCI documentation takes precedence. The intent of this
architecture is to provide a base architectural level which supports the PCI
architecture and to provide optional constructs which allow for use of 32-bit
PCI IOAs in platforms with greater than 4 GB of system addressability. R1--1.All PHBs that implement
conventional PCI must be compliant with the most recent version of the
at the time of their design, including any
approved Engineering Change Requests (ECRs) against that document. R1--2.All PHBs that
implement PCI-X must be compliant with the most
recent version of the at the time of
their design, including any approved Engineering Change Requests (ECRs) against
that document. R1--3.All PHBs that
implement PCI Express must be compliant with the
most recent version of the at the
time of their design, including any approved Engineering Change Requests (ECRs)
against that document.R1--4.All requirements
defined in for HBs must
be implemented by all PHBs in the platform.PHB Implementation OptionsThere are a few implementation options when it comes to
implementing a PHB. Some of these become requirements, depending on the
characteristics of the system for which the PHB is being designed. The options
affecting PHBs, include the following:The Enhanced I/O Error Handling (EEH) option enhances RAS
characteristics of the I/O and allows for smaller granularities of I/O
assignments to partitions in an LPAR environment.The Error Injection (ERRINJCT) option enhances the testing of the
I/O error recovery code. This option is required of bridges which implement the
EEH option.R1--1.All PHBs
for use in platforms which implement LPAR must support EEH, in support of virtualizations
requirements in and
.R1--2.All PCI HBs designed for use in platforms
which will support PCI Express must support the PCI extended configuration
address space and the MSI option.PCI Data Buffering and Instruction QueuingSome PHB
implementations may include buffers or queues for DMA,
Load, and Store operations. These buffers are required to be transparent to the
software with only certain exceptions, as noted in this section. Most
processor accesses to System Memory go through the processor data cache. When
sharing System Memory with IOAs, hardware must maintain consistency with the
processor data cache and the System Memory, as defined by the requirements in
.R1--1.PHB implementations which
include buffers or queues for DMA, Load, and
Store operations must make sure that these are transparent to the
software, with a few exceptions which are allowed by the PCI architecture, by
, and by .R1--2.PHBs must accept up to a 128 byte MMIO
Loads, and must do so without compromising performance
of other operations.PCI Load and Store OrderingFor the platform Load and
Store ordering requirements, see
and the
appropriate PCI specifications (per Requirements
,
, and
). Those requirements will, for most
implementations, require strong ordering (single threading) of all
Load and Store operations through the PHB,
regardless of the address space on the PCI bus to which they are targeted.
Single threading through the PHB means that processing a
Load requires that the PHB wait on the Load
response data of a Load issued on the PCI bus prior to
issuing the next Load or Store on
the PCI bus.PCI DMA OrderingFor the platform DMA ordering requirements, see the requirements in
this section, in , and the appropriate
PCI specifications (per Requirements
,
, and
).In general, the ordering for DMA path operations from the I/O bus
to the processor side of the PHB is independent from the
Load and Store path, with the exception stated
in Requirement . Note that in the
requirement, below, a read request is the initial request
to the PHB and the read completion is the data phase of
the transaction (that is, the data is returned).R1--1.(Requirement Number Reserved
For Compatibility)R1--2.(Requirement Number Reserved
For Compatibility)R1--3.(Requirement Number Reserved
For Compatibility)R1--4.The hardware must make sure that
a DMA read request from an IOA that specifies any byte address that has been
written by a previous DMA write operation (as defined by the untranslated PCI
address) does not complete before the DMA write from the previous DMA write is
in the coherency domain. R1--5.(Requirement Number Reserved
For Compatibility)R1--6.The hardware must make sure that
all DMA write data buffered from an IOA, which is destined for system memory,
is in the platform’s coherency domain prior to delivering data from a
Load operation through the same PHB which has come after
the DMA write operation(s).R1--7.The hardware must make sure that
all DMA write data buffered from an IOA, which is destined for system memory,
is in the platform’s coherency domain prior to delivering an MSI from
that same IOA which has come after the DMA write operation(s).Architecture Notes:Requirement clarifies
(and may tighten up) the PCI architecture requirement that the read be to the
“just-written” data. The address comparison for determining whether the address of the
data being read is the same as the address of that being written is in the same
cache line is based on the PCI address and not a TCE-translated address. This
says that the System Memory cache line address will be the same also, since the
requirement is directed towards operations under the same PHB. However, use of
a DMA Read Request and DMA Write Request that use different PCI addresses (even
if they hit the same System Memory address) are not required to be kept in
order (see Requirement ). So, for
example, Requirement says that split
PHBs that share the same data buffers at the system end do not have to keep DMA
Read Request following a DMA Write Request in order when they do not traverse
the same PHB PCI bus (even if they get translated to the same system address)
or when they originate on the same PCI bus but have different PCI bus addresses
(even if they get translated to the same system address).Requirement is the only
case where the Load and Store paths
are coupled to the DMA data path. This requirement guarantees that the software
has a method for forcing DMA write data out of any buffers in the path during
the servicing of a completion interrupt from the IOA. Note that the IOA can
perform the flush prior to the completion interrupt, via Requirement . That is, the IOA can issue a read request
to the last word written and wait for the read completion data to return. When
the read is complete, the data will have arrived at the destination. In
addition, the use of MSIs, instead of LSIs, allows for a programming model for
IOAs where the interrupt signalling itself pushes the last DMA write to System
Memory, prior to the signalling of the interrupt to the system (see Requirement
).A DMA read operation is allowed to be processed prior to the
completion of a previous DMA read operation, but is not required to be.PCI DMA Operations and CoherenceThe I/O is not aware of the setting of the coherence required bit
when performing operations to System Memory, and so the PHB needs to assume
that the coherency is required.R1--1.I/O transactions to System Memory through a PHB must be made with coherency required.Byte Ordering ConventionsLoPAR platforms operate with either Big-Endian (BE) or
Little-Endian addressing. In Big-Endian systems, the address of a word in
memory is the address of the most significant byte (the “big”
end) of the word. Increasing memory addresses will approach the least
significant byte of the word. In Little-Endian (LE) addressing, the address of
a word in memory is the address of the least significant byte (the
“little” end) of the word. See also
.The PCI bus itself can be thought of as not inherently having an
endianess associated with it (although its numbering convention indicates LE).
It is the IOAs on the PCI bus that can be thought of as having endianess
associated with them. Some PCI IOAs will contain a mode bit to allow them to
appear as either a BE or LE IOA. Some IOAs will even have multiple mode bits;
one for each data path (Load and Store versus DMA). In addition, some IOAs may
have multiple concurrent apertures, or address ranges, where the IOA can be
accessed as a LE IOA in one aperture and as a BE IOA in another.R1--1.When the
processor is operating in the Big-Endian mode, the platform design must produce the results
indicated in while issuing
Load and Store operations to various entities
with various endianess. R1--2.When performing DMA operations through a
PHB, the platform must not modify the data during the transfer process; the
lowest addressed byte in System Memory being transferred to the lowest
addressed byte on the PCI bus, the second byte in System Memory being
transferred as the second byte on the PCI bus, and so on.
Big-Endian Mode Load and
Store Programming Considerations DestinationTransfer Operation BE scalar entity:
For example,
TCE or BE register in a PCI IOA Load or Store LE scalar entity:
For example,
LE register in a PCI IOA or
PCI Configuration Registers Load or Store Reverse
PCI Bus ProtocolsThis section details the items from the
,
, and
documents where there is
variability allowed, and therefore further specifications, requirements, or
explanations are needed. Specifically, details
specific PCI Express options and the requirements for usage of such in LoPAR
platforms. These requirements will drive the design of PHB implementations. See
the for more information.
PCI Express Optional Feature Usage in LoPAR Platforms PCI Express Option NameUsageDescriptionBaseIBM ServerUsage Legend :
NS = Not Supported; O = Optional (see also Description); OR = Optional but
Recommended; R = Required; SD = See Description
Peripheral I/O Space SD SD Required if the platform is going to support any Legacy
I/O devices, as defined by the ,
otherwise support not required. The expectation is that Legacy I/O device
support by PHBs will end soon, so platform designers should not rely on this
being there when choosing I/O devices. 64-bit DMA addresses O OR
SD Implementation is optional, but is expected to be needed
in some platforms, especially those with more complex PCI Express fabrics.
Although the “ibm,dma-window” property can
implement 64-bit addresses, some OSs and Device Drivers may not be able to
handle values in the “ibm,dma-window”
property that are greater than or equal to 4 GB. Therefore, it is recommended
that 64-bit DMA addresses be implemented through the Dynamic DMA Window option
(see ). Advanced Error Reporting (AER) O R
SD This has implications in the IOAs selected for use in
the platform, as well as the PHB and firmware implementation. See the . PCIe Relaxed Ordering (RO) andID-Based Ordering (IDO) NS NS Enabling either of these options could allow DMA
transactions that should be dropped by an EEH Stopped State, to get to the
system before the EEH Stopped State is set, and therefore these options are not
to be enabled. Specifically, either of these could allow DMA transactions that
follow a DMA transaction in error to bypass the PCI Express error message
signalling an error on a previous packet.Platform Implementation Note: It is
permissible for the platform (for example, the PHB or the nest) to re-order DMA
transactions that it knows can be re-ordered -- such as DMA transactions that
come from different Requester IDs or come into different PHBs -- as long as the
ordering with respect to error signalling is met.5.0 GT/s signalling (Gen 2)OOR8 GT signalling (Gen 3)OOR TLP Processing HintsOOIf implemented, it must be transparent to OSs. Atomic Operations (32 and 64 bit) O OR
SD May be required if the IOAs being supported require it.
May specifically be needed for certain classes of IOAs such as
accelerators. Atomic Operations (128 bit) OOR
SD When 128 bit Atomic Operations are supported, 32 and 64
bit Atomic Operations must be also supported. Resizable BAR O O Dynamic Power Allocation (DPA) NS NS No support currently defined in LoPAR. Latency Tolerance Reporting (LTR) NS NS No support currently defined in LoPAR. Optimized Buffer Flush/Fill (OBFF) NS NS No support currently defined in LoPAR. PCIe Multicast NS NS No support currently defined in LoPAR. Alternative Routing ID Interpretation (ARI) O SD Required when the platform will support PCI Express IOV IOAs. Access Control Services (ACS) SD SD It is required that peer to peer operation between IOAs
be blocked when LPAR is implemented and those IOAs are assigned to different
LPAR partitions. For switches below a PHB, when the IOA functions below the
switch may be assigned to different partitions, this blocking is provided by
ACS in the switch. This is required even in Base platforms, if the above
conditions apply. Function Level Reset (FLR) SD SD Required when a PE consists of something other than a
full IOA. For example, if each function of a multi-function IOA each is in its
own PE. An SR-IOV Virtual Function (VF) may be one such example. End-to-End CRC (ECRC) O RSD This has implications in the IOAs selected for use in
the platform, as well as the PHB and firmware implementation. See the
. Internal Error ReportingOR
SDOR
SD Implement where appropriate. Platforms need to consider
this for platform switches, also. PHBs may report internal errors to firmware
using a different mechanism outside of this architecture. Address Translation Services (ATS) NS NS LoPAR does not support ATS, because the invalidation
and modification of the Address Translation and Protection Table (ATPT) --
called the TCEs in LoPAR -- is a synchronous operations, whereas the ATS
invalidation requires a more asynchronous operation. Page Request Interface (PRI) NS NS Requires ATS, which is not supported by LoPAR. Single Root I/O Virtualization (SR-IOV) O OR It is likely that most server platforms will need to be
enabled to use SR-IOV IOAs. Multi-Root I/O Virtualization (MR-IOV) SD SD Depending on how this is implemented, an MR-IOV device
is likely to look like an SR-IOV device to an OS (with the platform hiding the
Multi-root aspects). PHBs may be MR enabled or the MR support may be through
switches external to the PHBs.
Programming ModelNormal memory mapped Load and Store instructions are used to access
a PHB’s facilities or PCI IOAs on the I/O side of the PHB.
defines the addressing model.
Addresses of IOAs are passed by OF via the OF device tree.R1--1.If a PHB defines any registers that are
outside of the PCI Configuration space, then the address of those registers
must be in the Peripheral Memory Space or Peripheral I/O Space for that PHB, or
must be in the System Control Area.PCI
master DMA transfers refer to data transfers between a PCI master IOA and
another PCI IOA, or System Memory, where the PCI master IOA supplies the
addresses and controls all aspects of the data transfer. Transfers from a PCI
master to the PCI I/O Space are essentially ignored by a PHB (except for
address parity checking). Transfers from a PCI master to PCI Memory Space are
either directed at PCI Memory Space (for peer to peer operations) or need to be
directed to the host side of the PHB. DMA transfers directed to the host side
of a PHB may be to System Memory or may be to another IOA via the Peripheral
Memory Space of another HB. Transfers that are directed to the Peripheral I/O
Space of another HB are considered to be an addressing error (see
). For information about decoding these address spaces
and the address transforms necessary, see
.Peer-to-Peer Across Multiple PHBsThis architecture does not architect peer-to-peer traffic between
two PCI IOAs when the operation traverses multiple PHBs.R1--1.The platform must prevent Peer-to-Peer
operations that would cross multiple PHBs.Dynamic Reconfiguration of I/ODisconnecting or connecting an I/O subsystem while the system is
operational and then having the new configuration be operational, including any
new added subsystems, is a subset of Dynamic Reconfiguration (DR).Some platforms may also support plugging/unplugging of PCI IOAs
while the system is operational. This is another subset of DR.DR is an option and as such, is not required by this architecture.
Attempts to change the hardware configuration on a platform that does not
enable configuration change, whose OS does not support that configuration
change, or without the appropriate user configuration change actions may
produce unpredictable results (for example, the system may crash).PHBs in platforms that support the PCI Hot Plug Dynamic
Reconfiguration (DR) option may have some unique design considerations. For
information about the DR options, see .Split Bridge ImplementationsIn some platforms the PHB may be split into two pieces, separated
by a cable or fiber optics. The piece that is connected to the system bus (or
switch) and which generates the interconnect is called the Hub. There are
several implications of such implementations and several requirements to go
along with these. Coherency Considerations with IOA to IOA Communications
via System MemoryBridges which are split across multiple chips may introduce a large
enough latency between the time DMA write data is accepted by the PHB and the
time that previously cached copies of the same System Memory locations are
invalidated, and this latency needs to be taken into consideration in designs,
as it can introduce the problems described below. This is not a problem if the
same PCI address is used under a single PHB by the same or multiple IOAs, but
can be a problem under any of the following conditions:The same PCI address is used by different IOAs under different
PHBs.Different PCI addresses are used which access the same System
Memory coherency block, regardless of whether the IOA(s) are under the same PHB
or not; for example:Two different TCEs accessing the same System Memory coherency
block.An example scenario where this could be a problem is as
follows:Device 1 does a DMA read from System Memory address x using PCI
address y Device 2 (under same PHB as Device 1 -- the devices could even
be different function in the same IOA) does a DMA write to System Memory
address x using PCI address z.Device 2 attempts to read back System Memory address x before
the time that its previous DMA write is globally coherent (that is, before the
DMA write gets to the Hub and an invalidate operation on the cache line
containing that data gets back down to the PHB), and gets the data read by
Device 1 rather than what it just wrote.Another example scenario is as follows:Device 1 under PHB 1 does a DMA read from System Memory location
x.Device 2 under PHB 2 does a DMA write to System Memory location
x and signals an interrupt to the system.The interrupt bypasses the written data which is on its way to
the coherency domain.The device driver for Device 2 services the interrupt and
signals Device 1 via a Store to Device 1 that the data is there at location
x.Device 1 sees the Store before the invalidate operation on the
cache line containing the data propagates down to invalidate the previous
cached copy of x, and does a DMA read of location x using the same address as
in step (1), getting the old copy of x instead of the new copy.This last example is a little far-fetched since the propagation
times should not be longer than the interrupt service latency time, but it is
possible. In this example, the device driver should do a Load to Device 2
during the servicing of the interrupt and wait for the Load results before
trying to signal Device 1, just the way that this device driver would to a Load
if it was a program which was going to use the data written instead of another
IOA. Note that this scenario can also be avoided if the IOA uses a PCI Message
Signalled Interrupt (MSI) rather than the PCI interrupt signals pins, in order
to signal the interrupt (in which case the Load operation
is avoided).R1--1.A DMA read to a PCI address
which is different than a PCI address used by a previous DMA write or which is
performed under a different PHB must not presume that a previous DMA write is
complete, even if the DMA write is to the same System Memory address, unless
one of the following is true:The IOA doing the DMA write has followed that write by a DMA read
to the address of the last byte of DMA write data to be flushed (the DMA read
request must encompass the address of the last byte written, but does not need
to be limited to just that byte) and has waited for the results to come back
before an IOA is signaled (via peer-to-peer operations or via software) to
perform a DMA read to the same System Memory address.The device driver for the IOA doing the DMA write has followed
that write by a Load to that IOA and has waited for the
results to come back before a DMA read to the same System Memory address with a
different PCI address is attempted.The IOA doing the DMA write has followed the write with a PCI
Message Signalled Interrupt (MSI) as a way to interrupt the device driver, and
the MSI message has been received by the interrupt controller.I/O Bus to I/O Bus BridgesThe PCI bus architecture was designed to allow for bridging to other
slower speed I/O buses or to another PCI bus. The requirements when bridging
from one I/O bus to another I/O bus in the platform are defined below.R1--1.All bridges must comply with the
bus specification(s) of the buses to which they are attached.What Must Talk to WhatPlatforms are not required to support peer to peer operations
between IOAs. IOAs on the same shared bus segment will generally be able to do
peer to peer operations between themselves. Peer to peer operations in an LPAR
environment, when the operations are between IOAs that are not in the same
partition, is specifically prohibited (see Requirement
).PCI to PCI Bridges This architecture allows the use of PCI to PCI bridges and
PCI Express switches in the platform. TCEs are used with the IOAs attached to
the other side of the PCI to PCI bridge or PCI Express switch when those IOAs
are accessing something on the processor side of the PHB. After configuration,
PCI to PCI bridges and PCI Express switches are basically transparent to the
software as far as addressing is concerned (the exception is error handling).
For more information, see the appropriate PCI Express switch
specification.R1--1.Conventional PCI to PCI bridges used on
the base platform and plug-in cards must be compliant with the most recent
version of the at the time of the
platform design, including any approved Engineering Change Requests (ECRs)
against that document. PCI-X to PCI-X bridges used on the base platform and
plug-in cards must be compliant with the most recent version of the
at the time of the platform design,
including any approved Engineering Change Requests (ECRs) against that
document.R1--2.PCI Express to PCI/PCI-X and PCI/PCI-X to
PCI Express bridges used on the base platform and plug-in cards must be
compliant with the most recent version of the
at the time of the platform design,
including any approved Engineering Change Requests (ECRs) against that
document. R1--3.PCI Express switches used on the base
platform and plug-in cards must be compliant with the most recent version of
the at the time of the platform
design, including any approved Engineering Change Requests (ECRs) against that
document.R1--4.Bridges
and switches used in platforms which will support PCI
Express IOAs beneath them must support pass-through of PCI configuration cycles
which access the PCI extended configuration space.Software and Platform Implementation Notes:Bridges used on plug-in cards that do not follow Requirement
will presumably allow for the
operation of their IOAs on the plug-in card, even though not supporting the PCI
extended configuration address space, because the card was designed with the
bridges and IOAs in mind.Determination of support of the PCI configuration address space
is via the “ibm,pci-config-space-type”
property in the IOA's node.R1--5.Bridges and switches used in platforms
which will support PCI Express IOAs beneath them must support 64-bit
addressing.Bridge ExtensionsEnhanced I/O Error Handling (EEH) OptionThe EEH option uses the following terminology.PE A Partitionable Endpoint. This refers to the granule that is
treated as one for purposes of EEH recovery and for assignment to an OS image
(for example, in an LPAR environment). Note that the PE granularity supported
by the hardware may be finer than is supported by the firmware. See also
. A PE may be any one of the
following: A single-function or multi-function IOA A set of IOAs and some piece of I/O fabric above the IOAs that
consists of one or more bridges or switches.EEH Stopped state The state of a PE being in both the MMIO Stopped
state and DMA Stopped state.MMIO Stopped state The state of the PE which will discard any MMIO
Store s to that PE, and will return all-1's data for
Load s to that PE. If the PE is in the MMIO Stopped state
and EEH is disabled, then a Load will also return a
machine check to the processor that issued the Load, for
the Load that had the initial error and while the PE
remains in the MMIO Stopped state.DMA Stopped state The state of the PE which will block any further
DMA requests from that PE (DMA completions that occur after the DMA Stopped
state is entered that correspond to DMA requests that occurred before the DMA
Stopped state is entered, may be completed).Failure A detected error between the PE and the system (for
example, processor or memory); errors internal to the PE are not considered
failures unless the PE signals the error via a normal I/O fabric error
signalling protocol. (for example, SERR or ERR_FATAL).The Enhanced I/O Error Handling (EEH) option is defined primarily
to enhance the system recoverability from failures that occur during
Load and Store operations. In addition,
certain failures that are normally non-recoverable during DMA are prevented
from causing a catastrophic failure to the system (for example, a conventional
PCI address parity error).The basic concept behind the EEH option is to turn all failures
that cannot be reported to the IOA, into something that looks like a
conventional PCI Master Abort (MA) errorA conventional PCI MA error is where the
conventional PCI IOA does not respond as a target with a device select
indication (that is, the IOA does not respond by activating the DEVSEL signal
back to the master). For PCI Express, the corresponding error is Unsupported
Request (UR). on a Load or Store operation to the PE during
and after the failure; responding with all-1’s data and no error
indication on a Load instruction and ignoring
Store instructions. The MA error should be handled by a device
driver, so this approach should just be an extension to what should be the
error handling without this option implemented.The following is the general idea behind the EEH option:On a failure that occurs in an operation between the PHB and
PE:Put the PE into the MMIO Stopped and DMA Stopped states (also
known as the EEH Stopped state). This is defined as a state where the PE is
prevented from doing any further operations that could corrupt the system;
which for the most part means blocking DMA from the PE and preventing load and
store completions to the PE.While the PE is in the MMIO Stopped state, if a
Load or Store is targeted to that PE, then
return all-1’s data with no error indication on a
Load and discard all Stores to that PE. That
is, essentially treat the Load or
Store the same way as if a MA error was received on that
operation.The device driver and OS recovers a PE by removing it from the
MMIO Stopped state (keeping it in the DMA Stopped state) and doing any
necessary loads to the PE to capture PE state, and then either doing the
necessary stores to the PE to set the appropriate state before removing the PE
from the DMA Stopped state and continuing operations, or doing a reset of the
PE and then re-initializing and restarting the PE.Most
device drivers will implement a reset and
restart in order to assure a clean restart of
operations.In order to make sure that there are no interactions necessary
between device drivers during recovery operations, each PE will have the
capability of being removed from its MMIO Stopped and DMA Stopped states
independent from any other PE which is in the MMIO Stopped or DMA Stopped
state. In order to take into account device drivers which do not
correctly implement MA recovery, make sure that the EEH option can be enabled
and disabled independently for each PE.LPAR
implementations limit the capability of
running with EEH disabled (see Requirement
and Requirement ).EEH, as defined, only extends to operations between the processor
and a PE and between a PE and System Memory. It does not extend to direct IOA
to IOA peer to peer operations.Hardware changes for this option are detailed in the next section.
RTAS changes required are detailed in .EEH Option RequirementsAlthough the EEH option architecture may be extended to other I/O
topologies in the future, for now this recovery architecture will be limited to
PCI. In order to be able to test device driver additional code for the
EEH-enabled case, the EEH option also requires the Error Injection option be
implemented concurrently.The additional requirements on the hardware for this option are as
follows. For the RTAS requirements for this option, see
.R1--1.For the EEH option:
A platform must implement the Error Injection option concurrently
with the EEH option, with an error injection granularity to the PE level.R1--2.For the EEH option:
If a platform is going to implement the EEH option, then the I/O
topology implementing EEH must only consist of PCI components.R1--3.For the EEH option:
The hardware must provide a way to independently enable and disable
the EEH option for each PE with normal processor Load
and Store
instructions, and must provide the capability of doing this while not
disturbing operations to other PEs in the platform.R1--4.For the EEH option: The hardware
fault isolation register bits must be set the same way on errors when the EEH
option is enabled as they were when the EEH option is not implemented or when
it is implemented but disabled.R1--5.For the EEH option: Any
detected failure to/from a PE must set both the MMIO Stopped and DMA Stopped
states for the PE, unless the error that caused the failure can be reported to
the IOA in a way that the IOA will report the error to its device driver in a
way that will avoid any data corruption.R1--6.For the EEH
option: If an
I/O fabric consists of a hierarchy of components, then when a failure is
detected in the fabric, all PEs that are downstream of the failure must enter
the MMIO Stopped and DMA Stopped states if they may be affected by the
failure.R1--7.For the EEH
option: While a PE has its EEH option enabled, if a failure occurs,
the platform must not propagate it to the system as any type of error (for
example, as an SERR for a PE which is a conventional PCI-to-PCI bridge).R1--8.For the EEH option:
From the time that the MMIO Stopped state is entered for a PE, the
PE must be prevented from responding to Load and Store operations including the
operation that caused the PE to enter the MMIO Stopped state; a Load operation
must return all-1’s with no error indication and a Store operation must
be discarded (that is, Load and Store
operations being treated like they received a conventional PCI
Master Abort error), until one of the following is true:The ibm,set-eeh-option RTAS call is called
with function 2 (Release PE for MMIO
Load /Store operations).The ibm, set-slot-reset RTAS call is
called with function 0 (Deactivate the reset signal to
the PE).The power is cycled (off then on) to the PE.The partition or system is rebooted.R1--9.For the EEH option:
From the time that the DMA Stopped state is entered for a PE, the
PE must be prevented from initiating a new DMA request or completing a DMA
request that caused the PE to enter the DMA Stopped state (DMA requests that
were started before the DMA Stopped State is entered may be completed), and
including MSI DMA operations, until one of the following is true:The ibm,set-eeh-option RTAS call is called
with function 3 (Release PE for DMA operations).The ibm, set-slot-reset RTAS call is
called with function 0 (Deactivate the reset signal to
the PE).The power is cycled (off then on) to the PE.The partition or system is rebooted.R1--10.For the EEH option:
The hardware must provide the capability to the firmware to
determine, on a per-PE basis, that a failure has occurred which has caused the
PE to be put into the MMIO Stopped and DMA Stopped states and to read the
actual state information (MMIO Stopped state and DMA Stopped state).R1--11.For the EEH option:
The hardware must provide the capability of separately enabling and
resetting the DMA Stopped and MMIO Stopped states for a PE without disturbing
other PEs on the platform. The hardware must provide this capability without
requiring a PE reset and must do so through normal processor
Store instructions.R1--12.For the EEH option: The hardware
must provide the capability to the firmware to deactivate the reset to each PE,
independent of other PEs, and the hardware must provide the proper controls on
the reset transitions in order to prevent failures from being introduced into
the platform by the changing of the reset.R1--13.For the EEH option: The hardware
must provide the capability to the firmware to activate the reset to each PE,
independent of other PEs, and the hardware must provide the proper controls on
the reset transitions in order to prevent failures from being introduced into
the platform by the changing of the reset.R1--14.For the EEH option:
The hardware must provide the capability to the firmware to read
the state of the reset signal to each PE.R1--15.For the EEH option: When a PE is
put into the MMIO Stopped and DMA Stopped states, it must be done in such a way
to not introduce failures that may corrupt other parts of the platform.R1--16.For the EEH option:
The hardware must allow firmware access to internal bridge and I/O
fabric control registers when any or all of the PEs are in the MMIO Stopped
state.Platform Implementation Note: It is expected
that bridge and fabric control registers will have their own PE state separate
from the PEs for IOAs.R1--17.For the EEH
option: A PE that supports the EEH option must not share an
interrupt with another PE in the platform.Hardware Implementation Notes:Requirement means that
the hardware must always update the standard PCI error/status registers in the
bus’ configuration space as defined by the bus architecture, even when
the EEH option is enabled.The type of error information trapped by the hardware when a PE
is placed into the MMIO Stopped and DMA Stopped states is implementation
dependent. It is expected that the system software will do an check-exception
or ibm,slot-error-detail RTAS call to gather the error information when a
failure is detected.A DMA operation (Read or Write) that was initiated before a Load,
Store, or DMA error, does not
necessarily need to be blocked, as it was not a result of the
Load, Store, or DMA that failed. The normal
PCI Express ordering rules require that an ERR_FATAL or ERR_NONFATAL from a
failed Store or DMA error, or a
Load Completion with error status, will reach the PHB prior to any
DMA that might have been kicked-off in error as a result of a failed
Load or Store or a Load
or Store that follows a failed Load
or Store. This means that as long as the PHB processes
an ERR_FATAL, ERR_NONFATAL, or Load Completion which
indicates a failure, prior to processing any more DMA operations or
Load Completions, and puts the PE into the MMIO and Stopped DMA
Stopped states, implementations should be able to block DMA operations that
were kicked-off after a failing DMA operation and allow DMA operations that
were kicked off before a failing DMA operation without violating the normal PCI
Express ordering rules.In reference to Requirements
, and
, PCI Express implementations may choose to
enter the MMIO Stopped and DMA Stopped states even if an error can be reported
back to the IOA.R1--18.For the EEH option: If
the device driver(s) for any IOA(s) in a PE in the platform are EEH unaware
(that is may produce data integrity exposures due to a MMIO Stopped or DMA
Stopped state), then the firmware must prevent the IOA(s) in such a PE from
being enabled for operations (that is, do not allow the Bus Master, Memory
Space or I/O Space bits in the PCI configuration Command register from being
set to a 1) while EEH is enabled for that PE, and instead of preventing the PE
from being enabled, may instead turn off EEH when such an enable is attempted
without first an attempt by the device driver to enable EEH (by the
ibm,set-eeh-option ), providing such EEH disablement does not
violate any other requirement for EEH enablement (for example, Requirement
or
>).Software Implementation Note: To be EEH
aware, a device driver does not need to be able to recover from an MMIO Stopped
or DMA Stopped state, only recognize the all-1's condition and not use data
from operations that may have occurred since the last all-1's checkpoint. In
addition, the device driver under such failure circumstances needs to turn off
interrupts (using the ibm,set-int-off RTAS call or by
resetting the PE and keeping it reset with ibm,set-slot-reset or
ibm,slot-error-detail) in order to make
sure that any (unserviceable) interrupts from the PE do not affect the system.
Note that this is the same device driver support needed to protect against an
IOA dying or against a no-DEVSEL type error (which may or may not be the result
of an IOA that has died).Slot Level EEH Event Interrupt OptionSome platform implementations may allow asynchronous notification
of EEH events via an external interrupt. This is called the Slot Level EEH
Event Interrupt option. When implemented, the platform will implement the
“ibm,io-events-capable” property in the
nodes where the EEH control resides, and the ibm,set-eeh-option
RTAS call will implement function 4 to enable the
EEH interrupt for each of these nodes and function 5 to disable the EEH
interrupt for each of these nodes (individual control by node). Calling the
ibm,set-eeh-option RTAS call with function 4 or function
5 when the node specified does not implement this capability will return a -3,
indicating invalid parameters.The interrupt source specified in the
ibm,io-events child must be enabled (in addition to any individual
node enables) via the ibm,int-on RTAS call and the
priority for that interrupt, as set in the XIVE by the
ibm,set-xive RTAS call, must be something other than 0xFF, in order
for the external interrupt to be presented to the system.The “ibm,io-events-capable”
property, when it exists, contains 0 to N interrupt specifiers (per the
definition of interrupt specifiers for the node's interrupt parent). When no
interrupt specifiers are specified by the “ibm,io-events-capable”
property, then the interrupt, if enabled, is signaled via the interrupt specifier given in the
ibm,io-events child node of the /events
node.R1--1.For the Slot Level EEH Event
Interrupt option: All of the following must be true:The platform must implement the “ibm,io-events-capable”
property in all device tree nodes which represent bridge where EEH is implemented and for which the EEH
io-event interrupt is to be signaled.The platform must implement functions 4 and 5 of the
ibm,set-eeh-option RTAS call for all PEs under nodes that contain
the “ibm,io-events-capable” property.Error Injection (ERRINJCT) OptionThe Error Injection (ERRINJCT) option is defined primarily to test
enhanced error recovery software. As implemented in the I/O bridge, this option
is used to test the software which implements the recovery which is enabled by
the EEH option in that bridge. Specifically, the ioa-bus-error and
ioa-bus-error-64 functions
of the ibm,errinjct RTAS call are used to inject errors
onto each PE primary bus, which in turn will cause certain actions on the bus
and certain actions by the PE, the EEH logic, and by the error recovery
software.ERRINJCT Option Hardware RequirementsAlthough the ioa-bus-error and
ioa-bus-error-64 functions of the
ibm,errinjct RTAS call may be extended to other I/O buses and PEs in
the future, for now this architecture will be limited to PCI buses. The type of errors, and the injection qualifiers, place the
following additional requirements on the hardware for this option. R1--1.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
If a platform is going to implement either of these functions of this option, then
the I/O topology must be PCI.R1--2.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The hardware must provide a way to inject the required errors for
each PE primary bus, and the errors must be injectable independently, without
affecting the operations on the other buses in the platform.R1--3.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The hardware must provide a way to set up for the injection of the
required errors without disturbing operations to other buses outside the
PE.R1--4.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The hardware must provide a way to the firmware to set up the
following information for the error injection operation by normal processor
Load andStore instructions:Address at which to inject the errorAddress mask to mask off any combination of the least significant
24 (64 for the ioa-bus-error-64 function) bits of the
addressPE primary bus number which is to receive the errorType of error to be injectedR1--5.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The platform must have the capability of selecting the errors
specified in when the bus directly
below the bridge injecting the error is a Conventional PCI or PCI-X Bus, and
the errors specified in when the bus
directly below the bridge injecting the error is a PCI Express link, and when
that error is appropriate for the platform configuration, and the platform must
limit the injection of errors which are inappropriate for the given platform
configuration.Platform Implementation Note: As an example
of inappropriate errors to inject in Requirement
, consider the configuration where there is
an I/O bridge or switch below the bridge with the injector and that bridge
generates multiple PEs and when those PEs are assigned to different LPAR
partitions. In that case, injection of some real errors may cause the switches
or bridges to react and generate an error that affects multiple partitions,
which would be inappropriate. Therefore, to comply with Requirement
, the platform may either emulate some
errors in some configurations instead of injecting real errors on the link or
bus, or else the platform may not support injection at all to those PEs.
Another example where a particular error may be inappropriate is when there is
a heterogeneous network between the PHB and the PE (for example, a PCI Express
bridge that converts from a PCI Express PHB and a PCI-X PE).
Supported Errors for Conventional PCI, PCI-X Mode 1
or PCI-X Mode 2 Error InjectorsOperationPCI Address Space(s)Error (s)Other Requirements Load Memory, I/O, Config Data Parity Error All PCI-X adapters operating in Mode 2 and some
operating in Mode 1 utilize a double bit detecting, single bit correcting Error
Correction Code (ECC). In these cases, ensure that at least two bits are
modified to detect this error. Address Parity Error Store Memory, I/O, Config Data Parity Error Address Parity Error DMA read Memory Data Parity Error All PCI-X adapters operating in Mode 2 and some
operating in Mode 1 utilize a double bit detecting, single bit correcting Error
Correction Code (ECC). In these cases, ensure that at least two bits are
modified to detect this error. Address Parity Error Master Abort Target Abort DMA write Memory Data Parity Error All PCI-X adapters operating in Mode 2 and some
operating in Mode 1 utilize a double bit detecting, single bit correcting Error
Correction Code (ECC). In these cases, ensure that at least two bits are
modified to detect this error. Address Parity Error Master Abort Target Abort
Supported Errors for PCI Express Error InjectorsOperationPCI Address Space(s)Error (s)Other Requirements Load Memory, I/O, Config TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP.
Therefore, one cannot determine if the integrity error resides in the address
or data portion of a TLP. Store Memory, I/O, Config TLP ECRC Error DMA read Memory TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP.
Therefore, one cannot determine if the integrity error resides in the address
or data portion of a TLP. Completer Abort or Unsupported Request Inject the error that is injected on a TCE Page Fault. DMA write Memory TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP.
Therefore, one cannot determine if the integrity error resides in the address
or data portion of a TLP.
R1--6.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The hardware must provide a way to inject the errors in
in a non-persistent manner (that is, at
most one injection for each invocation of the ibm,errinjct RTAS call).ERRINJCT Option OF RequirementsThe Error Injection option will be disabled for all IOAs prior to
the OS getting control.R1--1.For theioa-bus-errorandioa-bus-error-64functions of the Error Injection option:
The OF must disable the ERRINJCT option for all PEs and all empty
slots on all bridges which implement this option prior to passing control to
the OS.Hardware and Firmware Implementation Note:
The platform only needs the capability to setup the injection of one error at a
time, and therefore injection facilities can be shared. The
ibm,open-errinjct and ibm,close-errinjct are
used to make sure that only one user is using the injection facilities at a
time.Bridged-I/O EEH Support OptionIf a platform requires multi-function I/O cards which are
constructed by placing multiple IOAs beneath a PCI to PCI bridge, then extra
support is needed to support such cards in an EEH-enabled environment. If this
option is implemented, then the ibm,configure-bridge RTAS
call will be implemented and therefore the
“ibm,configure-bridge” property will exist in the
rtas device node.R1--1.For the Bridged-I/O EEH
Support option: The platform must support the
ibm,configure-bridge RTAS call.R1--2.For the Bridged-I/O EEH
Support option: The OS must provide the correct EEH coordination
between device drivers that control multiple IOAs that are in the same
PE.