I/O Bridges and Topologies

I/O Bridges and Topologies There will be at least one bridge in a platform which interfaces to the system interconnect on the processor side, and interfaces to the Peripheral Component Interface (PCI) bus on the other. This bridge is called the PCI Host Bridge (PHB). The architectural requirements on the PHB, as well as other aspects of the I/O structures, PCI bridges, and PCI Express switches are defined in this chapter.

I/O Topologies and Endpoint Partitioning As systems get more sophisticated, partitioning of various components of the system will be used, in order to obtain greater Reliability, Availability, and Serviceability (RAS). For example, Dynamic Reconfiguration (DR) allows the removal, addition, and replacement of components from an OS’s pool of resources, without having to stop the operation of that OS. In addition, Logical Partitioning (LPAR) allows the isolation of resources used by one OS from those used by another. This section will discuss aspects of the partitioning of the I/O subsystem. Further information on DR and LPAR can be found in and . To be useful, the granularity of assignment of I/O resources to an OS needs to be fairly fine-grained. For example, it is not generally acceptable to require assignment of all I/O under the same PCI Host Bridge (PHB) to the same partition in an LPARed system, as that restricts configurability of the system, including the capability to dynamically move resources between partitionsDynamic LPAR or DLPAR is defined by the Logical Resource Dynamic Reconfiguration (LRDR) option. See for more information. Assignment of all IOAs under the same PHB to one partition may be acceptable if that I/O is shared via the Virtual I/O (VIO) capability defined in .. To be able to partition I/O adapters (IOAs), groups of IOAs or portions of IOAs for DR or to different OSs for LPAR will generally require some extra functionality in the platform (for example, I/O bridges and firmware) in order to be able to partition the resources of these groups, or endpoints, while at the same time preventing any of these endpoints from affecting another endpoint or getting access to another endpoint’s resources. These endpoints (that is, I/O subtrees) that can be treated as a unit for the purposes of partitioning and error recovery will be called Partitionable Endpoints (PEs)A “Partitionable Endpoint” in this architecture is not to be confused with what the PCI Express defines as an “endpoint.” PCI Express defines an endpoint as “a device with a Type 0x00 Configuration Space header.” That means PCI Express defines any entity with a unique Bus/Dev/Func # as an endpoint. In most implementations, a PE will not exactly correspond to this unit. and this concept will be called Endpoint Partitioning. A PE is defined by its Enhanced I/O Error Handling (EEH) domain and associated resources. The resources that need to be partitioned and not overlap with other PE domains include: The Memory Mapped I/O (MMIO) Load and Store address space which is available to the PE. This is accomplished by using the processor’s Page Table mechanism (through control of the contents of the Page Table Entries) and not having any part of two separate PEs’ MMIO address space overlap into the same 4 KB system page. Additionally, for LPAR environments, the Page Table Entries are controlled by the hypervisor. The DMA I/O bus address space which is available to the PE. This is accomplished by a hardware mechanism (in a bridge in the platform) which enforces the correct DMA addresses, and for LPAR, this hardware enforcement is set up by the hypervisor. It is also important that a mechanism be provided for LPAR such that the I/O bus addresses can further be limited at the system level to not intersect; so that one PE cannot get access to a partition’s memory to which it should not have access. The Translation Control Entry (TCE) mechanism, when controlled by the firmware (for example, a hypervisor), is such a mechanism. See for more information on the TCE mechanism. The configuration address space of the PE, as it is made available to the device driver. This is accomplished through controlling access to a PE’s configuration spaces through Run Time Abstraction Services (RTAS) calls, and for LPAR, these accesses are controlled by the hypervisor. The interrupts which are accessible to the PE. An interrupt cannot be shared between two PEs. For LPAR environments, the interrupt presentation and management is via the hypervisor. The error domains of the PE; that is, the error containment must be such that a PE error cannot affect another PE or, for LPAR, another partition or OS image to which the PE is not given access. This is accomplished though the use of the Enhanced I/O Error Handling (EEH) option of this architecture. For LPAR environments, the control of EEH is through the hypervisor via several RTAS calls. The reset domain: A reset domain contains all the components of a PE. The reset is provided programmatically and is intended to be implemented via an architected (non implementation dependent) method.For example, through a Standard Hot Plug Controller in a bridge, or through the Secondary Bus Reset bit in the Bridge Control register of a PCI bridge or switch. Resetting a component is sometimes necessary in order to be able to recover from some types of errors. A PE will equate to a reset domain, such that the entire PE can be reset by the ibm,set-slot-reset RTAS call. For LPAR, the control of the reset from the RTAS call is through the hypervisor. In addition to the above PE requirements, there may be other requirements on the power domains. Specifically, if a PE is going to participate in DR, including DLPAR,To prevent data from being transferred from one partition to another via data remaining in an IOA’s memory, most implementations of DLPAR will require the power cycling of the PE after removal from one partition and prior to assigning it to another partition. then either the power domain of the PE is required to be in a power domain which is separate from other PEs (that is, power domain, reset domain, and PE domain all the same), or else the control of that power domain and PCI Hot Plug (when implemented) of the contained PEs will be via the platform or a trusted platform agent. When the control of power for PCI Hot Plug is via the OS, then for LPAR environments, the control is also supervised via the hypervisor. It is possible to allow several cooperating device drivers to share a PE. Sharing of a PE between device drivers within one OS image is supported by the constructs in this architecture. Sharing between device drivers in different partitions is beyond the scope of the current architecture. A PE domain is defined by its top-most (closest to the PHB) PCI configuration address (in the terms of the RTAS calls, the PHB_Unit_ID_Hi, PHB_Unit_ID_Low, and config_addr ), which will be called the PE configuration address in this architecture, and encompasses everything below that in the I/O tree. The top-most PCI bus of the PE will be called the PE primary bus. Determination of the PE configuration address is made as described in . A summary of PE support can be found in . This architecture assumes that there is a single level of bridge within a PE if the PE is heterogeneous (some Conventional PCI Express), and these cases are shown by the shaded cells in the table. Conventional PCI Express PE Support Summary Function IOA Type PE Primary BusPCI Express PE determination (is EEH supported for the IOA?) All Use the ibm,read-slot-reset-state2 RTAS call. PE reset All PE reset is required for all PEs and is activated/deactivated via the ibm,set-slot-reset RTAS call. The PCI configuration address used in this call is the PE configuration address (the reset domain is the same as the PE domain). ibm,get-config-addr-info2RTAS call PCI Express Required to be implemented. Top of PE domain determinationPE configuration address is used as input to the RTAS calls which are used for PE control, namely: ibm,set-slot-reset, ibm,set-eeh-option, ibm,slot-error-detail, ibm,configure-bridge (How to obtain the PE configuration address) PCI Express Use the ibm,get-config-addr-info2 RTAS call to obtain PE configuration address. Shared PE determinationIf device driver is written for the shared PE environment, then this may be a don’t care. (is there more than one IOA per PE?) PCI Express Use the ibm,get-config-addr-info2 RTAS call. PEs per PCI Hot Plug domain and PCI Hot Plug control point PCI Express May have more than one PE per PCI Hot Plug DR entity, but a PE will be entirely encompassed by the PCI Hot Plug power domain. If more than one PE per DR entity, then PCI Hot Plug control is via the platform or some trusted platform agent.

R1--1. All platforms must implement the ibm,get-config-addr-info2 RTAS call. R1--2. All platforms must implement the ibm,read-slot-reset-state2 RTAS call. R1--3. For the EEH option: The resources of one PE must not overlap the resources of another PE, including: Error domains MMIO address ranges I/O bus DMA address ranges (when PEs are below the same PHB) Configuration space Interrupts R1--4. For the EEH option: All the following must be true relative to a PE: An IOA function must be totally encompassed by a PE. All PEs must be independently resetable by a reset domain. Architecture Note: The partitioning of PEs down to a single IOA function within a multi-function IOA requires a way to reset an individual IOA function within a multi-function IOA. For PCI, the only mechanism defined to do this is the optional PCI Express Function Level Reset (FLR). A platform supports FLR if it supports PCI Express and the partitioning of PEs down to a single IOA function within a multi-function IOA. When FLR is supported, if the ibm,set-slot-reset RTAS call uses FLR for the Function 1/Function 0 (activate/deactivate reset) sequence for an IOA function, then the platform provides the “ibm,pe-reset-is-flr” property in the function’s node of the OF device tree, See for more information. R1--5. The platform must own (be responsible for) any error recovery for errors that occur outside of all PEs (for example in switches and bridges above defined PEs). Implementation Note: As part of the error recovery of Requirement , the platform may, as part of the error handling of those errors, establish an equivalent EEH error state in the EEH domains of all PEs below the error point, in order to recover the hardware above those EEH domains from its error state. The platform also returns a PE Reset State of 5 (PE is unavailable) with a PE Unavailable Info non-zero (temporarily unavailable) while a recovery is in progress. R1--6. The platform must own (be responsible for) fault isolation for all errors that occur in the I/O fabric (that is, down to the IOA; including errors that occur on that part of the I/O fabric which is within a PE’s domain). R1--7. For the EEH option with the PCI Hot Plug option: All of the following must be true: If PCI Hot Plug operations are to be controlled by the OS to which the PE is assigned, then the PE domain for the PCI Hot Plug entity and the PCI Hot Plug power domain must be the same. All PE domains must be totally encompassed by their respective PCI Hot Plug power domain, regardless of the entity that controls the PCI Hot Plug operation. R1--8. All platforms that implement the EEH option must enable that option by default for all PEs. Implementation Notes: See and for requirements relative to EEH requirements with LPAR. Defaulting to EEH enabled, as required by Requirement does not imply that the platform has no responsibility in assuring that all device drivers are EEH enabled or EEH safe before allowing their the Bus Master, Memory Space or I/O Space bits in the PCI configuration Command register of their IOA to be set to a 1. Furthermore, even though a platform defaults its EEH option as enabled, as required by Requirement does not imply that the platform cannot disable EEH for a PE. See Requirement for more information. The following two figures show some examples of the concept of Endpoint Partitioning. See also for more information on the EEH option.

PE and DR Partitioning Examples for Conventional PCI and PCI-X HBs

PE and DR Partitioning Examples for PCI Express HBs

PCI Host Bridge (PHB) Architecture The PHB architecture places certain requirements on PHBs. There should be no conflict between this document and the PCI specifications, but if there is, the PCI documentation takes precedence. The intent of this architecture is to provide a base architectural level which supports the PCI architecture and to provide optional constructs which allow for use of 32-bit PCI IOAs in platforms with greater than 4 GB of system addressability. R1--1. All PHBs that implement conventional PCI must be compliant with the most recent version of the at the time of their design, including any approved Engineering Change Requests (ECRs) against that document. R1--2. All PHBs that implement PCI-X must be compliant with the most recent version of the at the time of their design, including any approved Engineering Change Requests (ECRs) against that document. R1--3. All PHBs that implement PCI Express must be compliant with the most recent version of the at the time of their design, including any approved Engineering Change Requests (ECRs) against that document. R1--4. All requirements defined in for HBs must be implemented by all PHBs in the platform.

PHB Implementation Options There are a few implementation options when it comes to implementing a PHB. Some of these become requirements, depending on the characteristics of the system for which the PHB is being designed. The options affecting PHBs, include the following: The Enhanced I/O Error Handling (EEH) option enhances RAS characteristics of the I/O and allows for smaller granularities of I/O assignments to partitions in an LPAR environment. The Error Injection (ERRINJCT) option enhances the testing of the I/O error recovery code. This option is required of bridges which implement the EEH option. R1--1. All PHBs for use in platforms which implement LPAR must support EEH, in support of virtualizations requirements in and . R1--2. All PCI HBs designed for use in platforms which will support PCI Express must support the PCI extended configuration address space and the MSI option.

PCI Data Buffering and Instruction Queuing Some PHB implementations may include buffers or queues for DMA, Load, and Store operations. These buffers are required to be transparent to the software with only certain exceptions, as noted in this section. Most processor accesses to System Memory go through the processor data cache. When sharing System Memory with IOAs, hardware must maintain consistency with the processor data cache and the System Memory, as defined by the requirements in . R1--1. PHB implementations which include buffers or queues for DMA, Load, and Store operations must make sure that these are transparent to the software, with a few exceptions which are allowed by the PCI architecture, by , and by . R1--2. PHBs must accept up to a 128 byte MMIO Loads, and must do so without compromising performance of other operations.

PCI <emphasis>Load</emphasis> and <emphasis>Store</emphasis> Ordering For the platform Load and Store ordering requirements, see and the appropriate PCI specifications (per Requirements , , and ). Those requirements will, for most implementations, require strong ordering (single threading) of all Load and Store operations through the PHB, regardless of the address space on the PCI bus to which they are targeted. Single threading through the PHB means that processing a Load requires that the PHB wait on the Load response data of a Load issued on the PCI bus prior to issuing the next Load or Store on the PCI bus.

PCI DMA Ordering For the platform DMA ordering requirements, see the requirements in this section, in , and the appropriate PCI specifications (per Requirements , , and ). In general, the ordering for DMA path operations from the I/O bus to the processor side of the PHB is independent from the Load and Store path, with the exception stated in Requirement . Note that in the requirement, below, a read request is the initial request to the PHB and the read completion is the data phase of the transaction (that is, the data is returned). R1--1. (Requirement Number Reserved For Compatibility) R1--2. (Requirement Number Reserved For Compatibility) R1--3. (Requirement Number Reserved For Compatibility) R1--4. The hardware must make sure that a DMA read request from an IOA that specifies any byte address that has been written by a previous DMA write operation (as defined by the untranslated PCI address) does not complete before the DMA write from the previous DMA write is in the coherency domain. R1--5. (Requirement Number Reserved For Compatibility) R1--6. The hardware must make sure that all DMA write data buffered from an IOA, which is destined for system memory, is in the platform’s coherency domain prior to delivering data from a Load operation through the same PHB which has come after the DMA write operation(s). R1--7. The hardware must make sure that all DMA write data buffered from an IOA, which is destined for system memory, is in the platform’s coherency domain prior to delivering an MSI from that same IOA which has come after the DMA write operation(s). Architecture Notes: Requirement clarifies (and may tighten up) the PCI architecture requirement that the read be to the “just-written” data. The address comparison for determining whether the address of the data being read is the same as the address of that being written is in the same cache line is based on the PCI address and not a TCE-translated address. This says that the System Memory cache line address will be the same also, since the requirement is directed towards operations under the same PHB. However, use of a DMA Read Request and DMA Write Request that use different PCI addresses (even if they hit the same System Memory address) are not required to be kept in order (see Requirement ). So, for example, Requirement says that split PHBs that share the same data buffers at the system end do not have to keep DMA Read Request following a DMA Write Request in order when they do not traverse the same PHB PCI bus (even if they get translated to the same system address) or when they originate on the same PCI bus but have different PCI bus addresses (even if they get translated to the same system address). Requirement is the only case where the Load and Store paths are coupled to the DMA data path. This requirement guarantees that the software has a method for forcing DMA write data out of any buffers in the path during the servicing of a completion interrupt from the IOA. Note that the IOA can perform the flush prior to the completion interrupt, via Requirement . That is, the IOA can issue a read request to the last word written and wait for the read completion data to return. When the read is complete, the data will have arrived at the destination. In addition, the use of MSIs, instead of LSIs, allows for a programming model for IOAs where the interrupt signalling itself pushes the last DMA write to System Memory, prior to the signalling of the interrupt to the system (see Requirement ). A DMA read operation is allowed to be processed prior to the completion of a previous DMA read operation, but is not required to be.

PCI DMA Operations and Coherence The I/O is not aware of the setting of the coherence required bit when performing operations to System Memory, and so the PHB needs to assume that the coherency is required. R1--1. I/O transactions to System Memory through a PHB must be made with coherency required.

Byte Ordering Conventions LoPAR platforms operate with either Big-Endian (BE) or Little-Endian addressing. In Big-Endian systems, the address of a word in memory is the address of the most significant byte (the “big” end) of the word. Increasing memory addresses will approach the least significant byte of the word. In Little-Endian (LE) addressing, the address of a word in memory is the address of the least significant byte (the “little” end) of the word. See also . The PCI bus itself can be thought of as not inherently having an endianess associated with it (although its numbering convention indicates LE). It is the IOAs on the PCI bus that can be thought of as having endianess associated with them. Some PCI IOAs will contain a mode bit to allow them to appear as either a BE or LE IOA. Some IOAs will even have multiple mode bits; one for each data path (Load and Store versus DMA). In addition, some IOAs may have multiple concurrent apertures, or address ranges, where the IOA can be accessed as a LE IOA in one aperture and as a BE IOA in another. R1--1. When the processor is operating in the Big-Endian mode, the platform design must produce the results indicated in while issuing Load and Store operations to various entities with various endianess. R1--2. When performing DMA operations through a PHB, the platform must not modify the data during the transfer process; the lowest addressed byte in System Memory being transferred to the lowest addressed byte on the PCI bus, the second byte in System Memory being transferred as the second byte on the PCI bus, and so on. Big-Endian Mode <emphasis>Load</emphasis> and <emphasis>Store</emphasis> Programming Considerations Destination Transfer Operation BE scalar entity: For example, TCE or BE register in a PCI IOA Load or Store LE scalar entity: For example, LE register in a PCI IOA or PCI Configuration Registers Load or Store Reverse

PCI Bus Protocols This section details the items from the , , and documents where there is variability allowed, and therefore further specifications, requirements, or explanations are needed. Specifically, details specific PCI Express options and the requirements for usage of such in LoPAR platforms. These requirements will drive the design of PHB implementations. See the for more information. PCI Express Optional Feature Usage in LoPAR Platforms PCI Express Option Name Usage Description Base IBM Server Usage Legend : NS = Not Supported; O = Optional (see also Description); OR = Optional but Recommended; R = Required; SD = See Description Peripheral I/O Space SD SD Required if the platform is going to support any Legacy I/O devices, as defined by the , otherwise support not required. The expectation is that Legacy I/O device support by PHBs will end soon, so platform designers should not rely on this being there when choosing I/O devices. 64-bit DMA addresses O OR SD Implementation is optional, but is expected to be needed in some platforms, especially those with more complex PCI Express fabrics. Although the “ibm,dma-window” property can implement 64-bit addresses, some OSs and Device Drivers may not be able to handle values in the “ibm,dma-window” property that are greater than or equal to 4 GB. Therefore, it is recommended that 64-bit DMA addresses be implemented through the Dynamic DMA Window option (see ). Advanced Error Reporting (AER) O R SD This has implications in the IOAs selected for use in the platform, as well as the PHB and firmware implementation. See the . PCIe Relaxed Ordering (RO) and ID-Based Ordering (IDO) NS NS Enabling either of these options could allow DMA transactions that should be dropped by an EEH Stopped State, to get to the system before the EEH Stopped State is set, and therefore these options are not to be enabled. Specifically, either of these could allow DMA transactions that follow a DMA transaction in error to bypass the PCI Express error message signalling an error on a previous packet. Platform Implementation Note: It is permissible for the platform (for example, the PHB or the nest) to re-order DMA transactions that it knows can be re-ordered -- such as DMA transactions that come from different Requester IDs or come into different PHBs -- as long as the ordering with respect to error signalling is met. 5.0 GT/s signalling (Gen 2) O OR 8 GT signalling (Gen 3) O OR TLP Processing Hints O O If implemented, it must be transparent to OSs. Atomic Operations (32 and 64 bit) O OR SD May be required if the IOAs being supported require it. May specifically be needed for certain classes of IOAs such as accelerators. Atomic Operations (128 bit) O OR SD When 128 bit Atomic Operations are supported, 32 and 64 bit Atomic Operations must be also supported. Resizable BAR O O Dynamic Power Allocation (DPA) NS NS No support currently defined in LoPAR. Latency Tolerance Reporting (LTR) NS NS No support currently defined in LoPAR. Optimized Buffer Flush/Fill (OBFF) NS NS No support currently defined in LoPAR. PCIe Multicast NS NS No support currently defined in LoPAR. Alternative Routing ID Interpretation (ARI) O SD Required when the platform will support PCI Express IOV IOAs. Access Control Services (ACS) SD SD It is required that peer to peer operation between IOAs be blocked when LPAR is implemented and those IOAs are assigned to different LPAR partitions. For switches below a PHB, when the IOA functions below the switch may be assigned to different partitions, this blocking is provided by ACS in the switch. This is required even in Base platforms, if the above conditions apply. Function Level Reset (FLR) SD SD Required when a PE consists of something other than a full IOA. For example, if each function of a multi-function IOA each is in its own PE. An SR-IOV Virtual Function (VF) may be one such example. End-to-End CRC (ECRC) O R SD This has implications in the IOAs selected for use in the platform, as well as the PHB and firmware implementation. See the . Internal Error Reporting OR SD OR SD Implement where appropriate. Platforms need to consider this for platform switches, also. PHBs may report internal errors to firmware using a different mechanism outside of this architecture. Address Translation Services (ATS) NS NS LoPAR does not support ATS, because the invalidation and modification of the Address Translation and Protection Table (ATPT) -- called the TCEs in LoPAR -- is a synchronous operations, whereas the ATS invalidation requires a more asynchronous operation. Page Request Interface (PRI) NS NS Requires ATS, which is not supported by LoPAR. Single Root I/O Virtualization (SR-IOV) O OR It is likely that most server platforms will need to be enabled to use SR-IOV IOAs. Multi-Root I/O Virtualization (MR-IOV) SD SD Depending on how this is implemented, an MR-IOV device is likely to look like an SR-IOV device to an OS (with the platform hiding the Multi-root aspects). PHBs may be MR enabled or the MR support may be through switches external to the PHBs.

Programming Model Normal memory mapped Load and Store instructions are used to access a PHB’s facilities or PCI IOAs on the I/O side of the PHB. defines the addressing model. Addresses of IOAs are passed by OF via the OF device tree. R1--1. If a PHB defines any registers that are outside of the PCI Configuration space, then the address of those registers must be in the Peripheral Memory Space or Peripheral I/O Space for that PHB, or must be in the System Control Area. PCI master DMA transfers refer to data transfers between a PCI master IOA and another PCI IOA, or System Memory, where the PCI master IOA supplies the addresses and controls all aspects of the data transfer. Transfers from a PCI master to the PCI I/O Space are essentially ignored by a PHB (except for address parity checking). Transfers from a PCI master to PCI Memory Space are either directed at PCI Memory Space (for peer to peer operations) or need to be directed to the host side of the PHB. DMA transfers directed to the host side of a PHB may be to System Memory or may be to another IOA via the Peripheral Memory Space of another HB. Transfers that are directed to the Peripheral I/O Space of another HB are considered to be an addressing error (see ). For information about decoding these address spaces and the address transforms necessary, see .

Peer-to-Peer Across Multiple PHBs This architecture does not architect peer-to-peer traffic between two PCI IOAs when the operation traverses multiple PHBs. R1--1. The platform must prevent Peer-to-Peer operations that would cross multiple PHBs.

Dynamic Reconfiguration of I/O Disconnecting or connecting an I/O subsystem while the system is operational and then having the new configuration be operational, including any new added subsystems, is a subset of Dynamic Reconfiguration (DR). Some platforms may also support plugging/unplugging of PCI IOAs while the system is operational. This is another subset of DR. DR is an option and as such, is not required by this architecture. Attempts to change the hardware configuration on a platform that does not enable configuration change, whose OS does not support that configuration change, or without the appropriate user configuration change actions may produce unpredictable results (for example, the system may crash). PHBs in platforms that support the PCI Hot Plug Dynamic Reconfiguration (DR) option may have some unique design considerations. For information about the DR options, see .

Split Bridge Implementations In some platforms the PHB may be split into two pieces, separated by a cable or fiber optics. The piece that is connected to the system bus (or switch) and which generates the interconnect is called the Hub. There are several implications of such implementations and several requirements to go along with these.

Coherency Considerations with IOA to IOA Communications via System Memory Bridges which are split across multiple chips may introduce a large enough latency between the time DMA write data is accepted by the PHB and the time that previously cached copies of the same System Memory locations are invalidated, and this latency needs to be taken into consideration in designs, as it can introduce the problems described below. This is not a problem if the same PCI address is used under a single PHB by the same or multiple IOAs, but can be a problem under any of the following conditions: The same PCI address is used by different IOAs under different PHBs. Different PCI addresses are used which access the same System Memory coherency block, regardless of whether the IOA(s) are under the same PHB or not; for example: Two different TCEs accessing the same System Memory coherency block. An example scenario where this could be a problem is as follows: Device 1 does a DMA read from System Memory address x using PCI address y Device 2 (under same PHB as Device 1 -- the devices could even be different function in the same IOA) does a DMA write to System Memory address x using PCI address z. Device 2 attempts to read back System Memory address x before the time that its previous DMA write is globally coherent (that is, before the DMA write gets to the Hub and an invalidate operation on the cache line containing that data gets back down to the PHB), and gets the data read by Device 1 rather than what it just wrote. Another example scenario is as follows: Device 1 under PHB 1 does a DMA read from System Memory location x. Device 2 under PHB 2 does a DMA write to System Memory location x and signals an interrupt to the system. The interrupt bypasses the written data which is on its way to the coherency domain. The device driver for Device 2 services the interrupt and signals Device 1 via a Store to Device 1 that the data is there at location x. Device 1 sees the Store before the invalidate operation on the cache line containing the data propagates down to invalidate the previous cached copy of x, and does a DMA read of location x using the same address as in step (1), getting the old copy of x instead of the new copy. This last example is a little far-fetched since the propagation times should not be longer than the interrupt service latency time, but it is possible. In this example, the device driver should do a Load to Device 2 during the servicing of the interrupt and wait for the Load results before trying to signal Device 1, just the way that this device driver would to a Load if it was a program which was going to use the data written instead of another IOA. Note that this scenario can also be avoided if the IOA uses a PCI Message Signalled Interrupt (MSI) rather than the PCI interrupt signals pins, in order to signal the interrupt (in which case the Load operation is avoided). R1--1. A DMA read to a PCI address which is different than a PCI address used by a previous DMA write or which is performed under a different PHB must not presume that a previous DMA write is complete, even if the DMA write is to the same System Memory address, unless one of the following is true: The IOA doing the DMA write has followed that write by a DMA read to the address of the last byte of DMA write data to be flushed (the DMA read request must encompass the address of the last byte written, but does not need to be limited to just that byte) and has waited for the results to come back before an IOA is signaled (via peer-to-peer operations or via software) to perform a DMA read to the same System Memory address. The device driver for the IOA doing the DMA write has followed that write by a Load to that IOA and has waited for the results to come back before a DMA read to the same System Memory address with a different PCI address is attempted. The IOA doing the DMA write has followed the write with a PCI Message Signalled Interrupt (MSI) as a way to interrupt the device driver, and the MSI message has been received by the interrupt controller.

I/O Bus to I/O Bus Bridges The PCI bus architecture was designed to allow for bridging to other slower speed I/O buses or to another PCI bus. The requirements when bridging from one I/O bus to another I/O bus in the platform are defined below. R1--1. All bridges must comply with the bus specification(s) of the buses to which they are attached.

What Must Talk to What Platforms are not required to support peer to peer operations between IOAs. IOAs on the same shared bus segment will generally be able to do peer to peer operations between themselves. Peer to peer operations in an LPAR environment, when the operations are between IOAs that are not in the same partition, is specifically prohibited (see Requirement ).

PCI to PCI Bridges This architecture allows the use of PCI to PCI bridges and PCI Express switches in the platform. TCEs are used with the IOAs attached to the other side of the PCI to PCI bridge or PCI Express switch when those IOAs are accessing something on the processor side of the PHB. After configuration, PCI to PCI bridges and PCI Express switches are basically transparent to the software as far as addressing is concerned (the exception is error handling). For more information, see the appropriate PCI Express switch specification. R1--1. Conventional PCI to PCI bridges used on the base platform and plug-in cards must be compliant with the most recent version of the at the time of the platform design, including any approved Engineering Change Requests (ECRs) against that document. PCI-X to PCI-X bridges used on the base platform and plug-in cards must be compliant with the most recent version of the at the time of the platform design, including any approved Engineering Change Requests (ECRs) against that document. R1--2. PCI Express to PCI/PCI-X and PCI/PCI-X to PCI Express bridges used on the base platform and plug-in cards must be compliant with the most recent version of the at the time of the platform design, including any approved Engineering Change Requests (ECRs) against that document. R1--3. PCI Express switches used on the base platform and plug-in cards must be compliant with the most recent version of the at the time of the platform design, including any approved Engineering Change Requests (ECRs) against that document. R1--4. Bridges and switches used in platforms which will support PCI Express IOAs beneath them must support pass-through of PCI configuration cycles which access the PCI extended configuration space. Software and Platform Implementation Notes: Bridges used on plug-in cards that do not follow Requirement will presumably allow for the operation of their IOAs on the plug-in card, even though not supporting the PCI extended configuration address space, because the card was designed with the bridges and IOAs in mind. Determination of support of the PCI configuration address space is via the “ibm,pci-config-space-type” property in the IOA's node. R1--5. Bridges and switches used in platforms which will support PCI Express IOAs beneath them must support 64-bit addressing.

Bridge Extensions

Enhanced I/O Error Handling (EEH) Option The EEH option uses the following terminology. PE A Partitionable Endpoint. This refers to the granule that is treated as one for purposes of EEH recovery and for assignment to an OS image (for example, in an LPAR environment). Note that the PE granularity supported by the hardware may be finer than is supported by the firmware. See also . A PE may be any one of the following: A single-function or multi-function IOA A set of IOAs and some piece of I/O fabric above the IOAs that consists of one or more bridges or switches. EEH Stopped state The state of a PE being in both the MMIO Stopped state and DMA Stopped state. MMIO Stopped state The state of the PE which will discard any MMIO Store s to that PE, and will return all-1's data for Load s to that PE. If the PE is in the MMIO Stopped state and EEH is disabled, then a Load will also return a machine check to the processor that issued the Load, for the Load that had the initial error and while the PE remains in the MMIO Stopped state. DMA Stopped state The state of the PE which will block any further DMA requests from that PE (DMA completions that occur after the DMA Stopped state is entered that correspond to DMA requests that occurred before the DMA Stopped state is entered, may be completed). Failure A detected error between the PE and the system (for example, processor or memory); errors internal to the PE are not considered failures unless the PE signals the error via a normal I/O fabric error signalling protocol. (for example, SERR or ERR_FATAL). The Enhanced I/O Error Handling (EEH) option is defined primarily to enhance the system recoverability from failures that occur during Load and Store operations. In addition, certain failures that are normally non-recoverable during DMA are prevented from causing a catastrophic failure to the system (for example, a conventional PCI address parity error). The basic concept behind the EEH option is to turn all failures that cannot be reported to the IOA, into something that looks like a conventional PCI Master Abort (MA) errorA conventional PCI MA error is where the conventional PCI IOA does not respond as a target with a device select indication (that is, the IOA does not respond by activating the DEVSEL signal back to the master). For PCI Express, the corresponding error is Unsupported Request (UR). on a Load or Store operation to the PE during and after the failure; responding with all-1’s data and no error indication on a Load instruction and ignoring Store instructions. The MA error should be handled by a device driver, so this approach should just be an extension to what should be the error handling without this option implemented. The following is the general idea behind the EEH option: On a failure that occurs in an operation between the PHB and PE: Put the PE into the MMIO Stopped and DMA Stopped states (also known as the EEH Stopped state). This is defined as a state where the PE is prevented from doing any further operations that could corrupt the system; which for the most part means blocking DMA from the PE and preventing load and store completions to the PE. While the PE is in the MMIO Stopped state, if a Load or Store is targeted to that PE, then return all-1’s data with no error indication on a Load and discard all Stores to that PE. That is, essentially treat the Load or Store the same way as if a MA error was received on that operation. The device driver and OS recovers a PE by removing it from the MMIO Stopped state (keeping it in the DMA Stopped state) and doing any necessary loads to the PE to capture PE state, and then either doing the necessary stores to the PE to set the appropriate state before removing the PE from the DMA Stopped state and continuing operations, or doing a reset of the PE and then re-initializing and restarting the PE.Most device drivers will implement a reset and restart in order to assure a clean restart of operations. In order to make sure that there are no interactions necessary between device drivers during recovery operations, each PE will have the capability of being removed from its MMIO Stopped and DMA Stopped states independent from any other PE which is in the MMIO Stopped or DMA Stopped state. In order to take into account device drivers which do not correctly implement MA recovery, make sure that the EEH option can be enabled and disabled independently for each PE.LPAR implementations limit the capability of running with EEH disabled (see Requirement and Requirement ). EEH, as defined, only extends to operations between the processor and a PE and between a PE and System Memory. It does not extend to direct IOA to IOA peer to peer operations. Hardware changes for this option are detailed in the next section. RTAS changes required are detailed in .

EEH Option Requirements Although the EEH option architecture may be extended to other I/O topologies in the future, for now this recovery architecture will be limited to PCI. In order to be able to test device driver additional code for the EEH-enabled case, the EEH option also requires the Error Injection option be implemented concurrently. The additional requirements on the hardware for this option are as follows. For the RTAS requirements for this option, see . R1--1. For the EEH option: A platform must implement the Error Injection option concurrently with the EEH option, with an error injection granularity to the PE level. R1--2. For the EEH option: If a platform is going to implement the EEH option, then the I/O topology implementing EEH must only consist of PCI components. R1--3. For the EEH option: The hardware must provide a way to independently enable and disable the EEH option for each PE with normal processor Load and Store instructions, and must provide the capability of doing this while not disturbing operations to other PEs in the platform. R1--4. For the EEH option: The hardware fault isolation register bits must be set the same way on errors when the EEH option is enabled as they were when the EEH option is not implemented or when it is implemented but disabled. R1--5. For the EEH option: Any detected failure to/from a PE must set both the MMIO Stopped and DMA Stopped states for the PE, unless the error that caused the failure can be reported to the IOA in a way that the IOA will report the error to its device driver in a way that will avoid any data corruption. R1--6. For the EEH option: If an I/O fabric consists of a hierarchy of components, then when a failure is detected in the fabric, all PEs that are downstream of the failure must enter the MMIO Stopped and DMA Stopped states if they may be affected by the failure. R1--7. For the EEH option: While a PE has its EEH option enabled, if a failure occurs, the platform must not propagate it to the system as any type of error (for example, as an SERR for a PE which is a conventional PCI-to-PCI bridge). R1--8. For the EEH option: From the time that the MMIO Stopped state is entered for a PE, the PE must be prevented from responding to Load and Store operations including the operation that caused the PE to enter the MMIO Stopped state; a Load operation must return all-1’s with no error indication and a Store operation must be discarded (that is, Load and Store operations being treated like they received a conventional PCI Master Abort error), until one of the following is true: The ibm,set-eeh-option RTAS call is called with function 2 (Release PE for MMIO Load /Store operations). The ibm, set-slot-reset RTAS call is called with function 0 (Deactivate the reset signal to the PE). The power is cycled (off then on) to the PE. The partition or system is rebooted. R1--9. For the EEH option: From the time that the DMA Stopped state is entered for a PE, the PE must be prevented from initiating a new DMA request or completing a DMA request that caused the PE to enter the DMA Stopped state (DMA requests that were started before the DMA Stopped State is entered may be completed), and including MSI DMA operations, until one of the following is true: The ibm,set-eeh-option RTAS call is called with function 3 (Release PE for DMA operations). The ibm, set-slot-reset RTAS call is called with function 0 (Deactivate the reset signal to the PE). The power is cycled (off then on) to the PE. The partition or system is rebooted. R1--10. For the EEH option: The hardware must provide the capability to the firmware to determine, on a per-PE basis, that a failure has occurred which has caused the PE to be put into the MMIO Stopped and DMA Stopped states and to read the actual state information (MMIO Stopped state and DMA Stopped state). R1--11. For the EEH option: The hardware must provide the capability of separately enabling and resetting the DMA Stopped and MMIO Stopped states for a PE without disturbing other PEs on the platform. The hardware must provide this capability without requiring a PE reset and must do so through normal processor Store instructions. R1--12. For the EEH option: The hardware must provide the capability to the firmware to deactivate the reset to each PE, independent of other PEs, and the hardware must provide the proper controls on the reset transitions in order to prevent failures from being introduced into the platform by the changing of the reset. R1--13. For the EEH option: The hardware must provide the capability to the firmware to activate the reset to each PE, independent of other PEs, and the hardware must provide the proper controls on the reset transitions in order to prevent failures from being introduced into the platform by the changing of the reset. R1--14. For the EEH option: The hardware must provide the capability to the firmware to read the state of the reset signal to each PE. R1--15. For the EEH option: When a PE is put into the MMIO Stopped and DMA Stopped states, it must be done in such a way to not introduce failures that may corrupt other parts of the platform. R1--16. For the EEH option: The hardware must allow firmware access to internal bridge and I/O fabric control registers when any or all of the PEs are in the MMIO Stopped state. Platform Implementation Note: It is expected that bridge and fabric control registers will have their own PE state separate from the PEs for IOAs. R1--17. For the EEH option: A PE that supports the EEH option must not share an interrupt with another PE in the platform. Hardware Implementation Notes: Requirement means that the hardware must always update the standard PCI error/status registers in the bus’ configuration space as defined by the bus architecture, even when the EEH option is enabled. The type of error information trapped by the hardware when a PE is placed into the MMIO Stopped and DMA Stopped states is implementation dependent. It is expected that the system software will do an check-exception or ibm,slot-error-detail RTAS call to gather the error information when a failure is detected. A DMA operation (Read or Write) that was initiated before a Load, Store, or DMA error, does not necessarily need to be blocked, as it was not a result of the Load, Store, or DMA that failed. The normal PCI Express ordering rules require that an ERR_FATAL or ERR_NONFATAL from a failed Store or DMA error, or a Load Completion with error status, will reach the PHB prior to any DMA that might have been kicked-off in error as a result of a failed Load or Store or a Load or Store that follows a failed Load or Store. This means that as long as the PHB processes an ERR_FATAL, ERR_NONFATAL, or Load Completion which indicates a failure, prior to processing any more DMA operations or Load Completions, and puts the PE into the MMIO and Stopped DMA Stopped states, implementations should be able to block DMA operations that were kicked-off after a failing DMA operation and allow DMA operations that were kicked off before a failing DMA operation without violating the normal PCI Express ordering rules. In reference to Requirements , and , PCI Express implementations may choose to enter the MMIO Stopped and DMA Stopped states even if an error can be reported back to the IOA. R1--18. For the EEH option: If the device driver(s) for any IOA(s) in a PE in the platform are EEH unaware (that is may produce data integrity exposures due to a MMIO Stopped or DMA Stopped state), then the firmware must prevent the IOA(s) in such a PE from being enabled for operations (that is, do not allow the Bus Master, Memory Space or I/O Space bits in the PCI configuration Command register from being set to a 1) while EEH is enabled for that PE, and instead of preventing the PE from being enabled, may instead turn off EEH when such an enable is attempted without first an attempt by the device driver to enable EEH (by the ibm,set-eeh-option ), providing such EEH disablement does not violate any other requirement for EEH enablement (for example, Requirement or >). Software Implementation Note: To be EEH aware, a device driver does not need to be able to recover from an MMIO Stopped or DMA Stopped state, only recognize the all-1's condition and not use data from operations that may have occurred since the last all-1's checkpoint. In addition, the device driver under such failure circumstances needs to turn off interrupts (using the ibm,set-int-off RTAS call or by resetting the PE and keeping it reset with ibm,set-slot-reset or ibm,slot-error-detail) in order to make sure that any (unserviceable) interrupts from the PE do not affect the system. Note that this is the same device driver support needed to protect against an IOA dying or against a no-DEVSEL type error (which may or may not be the result of an IOA that has died).

Slot Level EEH Event Interrupt Option Some platform implementations may allow asynchronous notification of EEH events via an external interrupt. This is called the Slot Level EEH Event Interrupt option. When implemented, the platform will implement the “ibm,io-events-capable” property in the nodes where the EEH control resides, and the ibm,set-eeh-option RTAS call will implement function 4 to enable the EEH interrupt for each of these nodes and function 5 to disable the EEH interrupt for each of these nodes (individual control by node). Calling the ibm,set-eeh-option RTAS call with function 4 or function 5 when the node specified does not implement this capability will return a -3, indicating invalid parameters. The interrupt source specified in the ibm,io-events child must be enabled (in addition to any individual node enables) via the ibm,int-on RTAS call and the priority for that interrupt, as set in the XIVE by the ibm,set-xive RTAS call, must be something other than 0xFF, in order for the external interrupt to be presented to the system. The “ibm,io-events-capable” property, when it exists, contains 0 to N interrupt specifiers (per the definition of interrupt specifiers for the node's interrupt parent). When no interrupt specifiers are specified by the “ibm,io-events-capable” property, then the interrupt, if enabled, is signaled via the interrupt specifier given in the ibm,io-events child node of the /events node. R1--1. For the Slot Level EEH Event Interrupt option: All of the following must be true: The platform must implement the “ibm,io-events-capable” property in all device tree nodes which represent bridge where EEH is implemented and for which the EEH io-event interrupt is to be signaled. The platform must implement functions 4 and 5 of the ibm,set-eeh-option RTAS call for all PEs under nodes that contain the “ibm,io-events-capable” property.

Error Injection (ERRINJCT) Option The Error Injection (ERRINJCT) option is defined primarily to test enhanced error recovery software. As implemented in the I/O bridge, this option is used to test the software which implements the recovery which is enabled by the EEH option in that bridge. Specifically, the ioa-bus-error and ioa-bus-error-64 functions of the ibm,errinjct RTAS call are used to inject errors onto each PE primary bus, which in turn will cause certain actions on the bus and certain actions by the PE, the EEH logic, and by the error recovery software.

ERRINJCT Option Hardware Requirements Although the ioa-bus-error and ioa-bus-error-64 functions of the ibm,errinjct RTAS call may be extended to other I/O buses and PEs in the future, for now this architecture will be limited to PCI buses. The type of errors, and the injection qualifiers, place the following additional requirements on the hardware for this option. R1--1. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: If a platform is going to implement either of these functions of this option, then the I/O topology must be PCI. R1--2. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The hardware must provide a way to inject the required errors for each PE primary bus, and the errors must be injectable independently, without affecting the operations on the other buses in the platform. R1--3. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The hardware must provide a way to set up for the injection of the required errors without disturbing operations to other buses outside the PE. R1--4. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The hardware must provide a way to the firmware to set up the following information for the error injection operation by normal processor Load andStore instructions: Address at which to inject the error Address mask to mask off any combination of the least significant 24 (64 for the ioa-bus-error-64 function) bits of the address PE primary bus number which is to receive the error Type of error to be injected R1--5. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The platform must have the capability of selecting the errors specified in when the bus directly below the bridge injecting the error is a Conventional PCI or PCI-X Bus, and the errors specified in when the bus directly below the bridge injecting the error is a PCI Express link, and when that error is appropriate for the platform configuration, and the platform must limit the injection of errors which are inappropriate for the given platform configuration. Platform Implementation Note: As an example of inappropriate errors to inject in Requirement , consider the configuration where there is an I/O bridge or switch below the bridge with the injector and that bridge generates multiple PEs and when those PEs are assigned to different LPAR partitions. In that case, injection of some real errors may cause the switches or bridges to react and generate an error that affects multiple partitions, which would be inappropriate. Therefore, to comply with Requirement , the platform may either emulate some errors in some configurations instead of injecting real errors on the link or bus, or else the platform may not support injection at all to those PEs. Another example where a particular error may be inappropriate is when there is a heterogeneous network between the PHB and the PE (for example, a PCI Express bridge that converts from a PCI Express PHB and a PCI-X PE). Supported Errors for Conventional PCI, PCI-X Mode 1 or PCI-X Mode 2 Error Injectors Operation PCI Address Space(s) Error (s) Other Requirements Load Memory, I/O, Config Data Parity Error All PCI-X adapters operating in Mode 2 and some operating in Mode 1 utilize a double bit detecting, single bit correcting Error Correction Code (ECC). In these cases, ensure that at least two bits are modified to detect this error. Address Parity Error Store Memory, I/O, Config Data Parity Error Address Parity Error DMA read Memory Data Parity Error All PCI-X adapters operating in Mode 2 and some operating in Mode 1 utilize a double bit detecting, single bit correcting Error Correction Code (ECC). In these cases, ensure that at least two bits are modified to detect this error. Address Parity Error Master Abort Target Abort DMA write Memory Data Parity Error All PCI-X adapters operating in Mode 2 and some operating in Mode 1 utilize a double bit detecting, single bit correcting Error Correction Code (ECC). In these cases, ensure that at least two bits are modified to detect this error. Address Parity Error Master Abort Target Abort

Supported Errors for PCI Express Error Injectors Operation PCI Address Space(s) Error (s) Other Requirements Load Memory, I/O, Config TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP. Therefore, one cannot determine if the integrity error resides in the address or data portion of a TLP. Store Memory, I/O, Config TLP ECRC Error DMA read Memory TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP. Therefore, one cannot determine if the integrity error resides in the address or data portion of a TLP. Completer Abort or Unsupported Request Inject the error that is injected on a TCE Page Fault. DMA write Memory TLP ECRC Error The TLP ECRC covers the address and data bits of a TLP. Therefore, one cannot determine if the integrity error resides in the address or data portion of a TLP.

R1--6. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The hardware must provide a way to inject the errors in in a non-persistent manner (that is, at most one injection for each invocation of the ibm,errinjct RTAS call).

ERRINJCT Option OF Requirements The Error Injection option will be disabled for all IOAs prior to the OS getting control. R1--1. For the ioa-bus-error and ioa-bus-error-64 functions of the Error Injection option: The OF must disable the ERRINJCT option for all PEs and all empty slots on all bridges which implement this option prior to passing control to the OS. Hardware and Firmware Implementation Note: The platform only needs the capability to setup the injection of one error at a time, and therefore injection facilities can be shared. The ibm,open-errinjct and ibm,close-errinjct are used to make sure that only one user is using the injection facilities at a time.

Bridged-I/O EEH Support Option If a platform requires multi-function I/O cards which are constructed by placing multiple IOAs beneath a PCI to PCI bridge, then extra support is needed to support such cards in an EEH-enabled environment. If this option is implemented, then the ibm,configure-bridge RTAS call will be implemented and therefore the “ibm,configure-bridge” property will exist in the rtas device node. R1--1. For the Bridged-I/O EEH Support option: The platform must support the ibm,configure-bridge RTAS call. R1--2. For the Bridged-I/O EEH Support option: The OS must provide the correct EEH coordination between device drivers that control multiple IOAs that are in the same PE.