Processor and Memory

Processor Architecture The Processor Architecture (PA) governs software compatibility at an instruction set and environment level. However, each processor implementation has unique characteristics which are described in its user’s manual. To facilitate shrink-wrapped software, this architecture places some limitations on the variability in processor implementations. Nonetheless, evolution of the PA and implementations creates a need for both software and hardware developers to stay current with its progress. The following material highlights areas deserving special attention and provides pointers to the latest information.

Processor Architecture Compliance The PA is defined in . R1--1. Platforms must incorporate only processors which comply fully with . R1--2. For the Symmetric Multiprocessor option: Multiprocessing platforms must use only processors which implement the processor identification register. R1--3. Platforms must incorporate only processors which implement tlbie and tlbsync, and slbie and slbia for 64-bit implementations. R1--4. Except where specifically noted otherwise in , platforms must support all functions specified by the PA. Hardware and Software Implementation Note: The PA and this architecture view tlbia as an optional performance enhancement. Processors need not implement tlbia. Software that needs to purge the TLB should provide a sequence of instructions that is functionally equivalent to tlbia and use the content of the OF device tree to choose the software implementation or the hardware instruction. See for details.

PA Processor Differences A complete understanding of processor differences may be obtained by studying and the user’s manuals for the various processors. The creators of this architecture cooperate with processor designers to maintain a list of supported differences, to be used by the OS instead of the processor version number (PVN), enabling execution on future processors. OF communicates these differences via properties of the cpu node of the OF device tree. Examples of OF device tree properties which support these differences include “64-bit” and “performance-monitor”. See for a complete listing and more details. R1--1. The OS must use the properties of the cpu node of the OF device tree to determine the programming model of the processor implementation. R1--2. The OS must provide an execution path which uses the properties of the cpu node of the OF device. The PVN is available to the platform aware OS for exceptional cases such as performance optimization and errata handling. R1--3. The OS must support the 64-bit page table formats defined by . R1--4. Processors which exhibit the “64-bit” property of the cpu node of the OF device tree must also implement the “bridge architecture,” an option in . R1--5. Platforms must restrict their choice of processors to those whose programming models may be described by the properties defined for the cpu node of the OF device tree in . R1--6. Platform firmware must initialize the second and third pages above Base correctly for the processor in the platform prior to giving control to the OS. R1--7. OS and application software must not alter the state of the second and third pages above Base. R1--8. Platforms must implement the “ibm,platform-hardware-notification” property (see ) and include all PVRs that the platform may contain.

64-bit Implementations Some 64-bit processor implementations will not support the full virtual address allowed by . As a result, this architecture adds a 64-bit virtual address subset to the PA and the corresponding cpu node property “64-bit-virtual-address” to OF. In order for an OS to make use of the increased addressability of 64-bit processor implementations: The memory subsystem must support the addressing of memory located at or beyond 4 GB, and Any system memory located at or beyond 4 GB must be reported via the OF device tree. At an abstract level, the effort to support 64-bit architecture in platforms is modest. The requirements follow. R1--1. The OS must support the 64-bit virtual address subset, but may defer support of the full 80-bit virtual address until such time as it is required. R1--2. Firmware must report the “64-bit-virtual-address” property for processors which implement the 64-bit virtual address subset. R1--3. RTAS must be capable of being instantiated in either a 32-bit or 64-bit mode on a platform with addressable memory above 4 GB. Software Implementation Note: A 64-bit OS need not require 64-bit client interface services in order to boot. Because of the problems that might be introduced by dynamically switching between 32-bit and 64-bit modes in OF, the configuration variable 64-bit-mode? is provided so that OF can statically configure itself to the needs of the OS.

Processor Interface Variations Individual processor interface implementations are described in their respective user’s manuals.

PA Features Deserving Comment Some PA features are optional, and need not be implemented in a platform. Usage of others may be discouraged due to their potential for poor performance. The following sections elaborate on the disposition of these features in regard to compliance with the PA.

Multiple Scalar Operations The PA supports multiple scalar operations. The multiple scalar operations are Load and Store String and Load and Store Multiple. Due to the long-term performance disadvantage associated with multiple scalar operations, their use by software is not recommended.

External Control Instructions (Optional) The external control instructions (eciwx and ecowx) are not supported by this architecture.

<emphasis role="bold"><literal>cpu</literal></emphasis> Node <emphasis role="bold"><literal>“Status”</literal></emphasis> Property See for the values of the “status” property of the cpu node.

Multi-Threading Processor Option Power processors may optionally support multi-threading. R1--1. For the Multi-threading Processor option: The platform must supply one entry in the ibm,ppc-interrupt-server#s property associated with the processor for each thread that the processor supports. Refer to for the definition of the ibm,ppc-interrupt-server#s property.

Memory Architecture The Memory Architecture of an LoPAR implementation is defined by and , which defines what platform elements are accessed by each real (physical) system address, as well as the sections which follow. The PA allows implementations to incorporate such performance enhancing features as write-back caching, non-coherent instruction caches, pipelining, and out-of-order and speculative execution. These features introduce the concepts of coherency (the apparent order of storage operations to a single memory location as observed by other processors and DMA) and consistency (the order of storage accesses among multiple locations). In most cases, these features are transparent to software. However, in certain circumstances, OS software explicitly manages the order and buffering of storage operations. By selectively eliminating ordering options, either via storage access mode bits or the introduction of storage barrier instructions, software can force increasingly restrictive ordering semantics upon its storage operations. Refer to for further details. PA processor designs usually allow, under certain conditions, for caching, buffering, combining, and reordering in the platform’s memory and I/O subsystems. The platform’s memory subsystem, system interconnect, and processors, which cooperate through a platform implementation specific protocol to meet the PA specified memory coherence, consistency, and caching rules, are said to be within the platform’s coherency domain. shows an example system. The shaded portion is the PA coherency domain. Buses 1 through 3 lie outside this domain. The figure shows two I/O subsystems, each interfacing with the host system via a Host Bridge. Notice that the domain includes portions of the Host Bridges. This symbolizes the role of the bridge to apply PA semantics to reference streams as they enter or leave the coherency domain, while implementing the ordering rules of the I/O bus architecture. Memory, other than System Memory, is not required to be coherent. Such memory may include memory in IOAs.

Example System Diagram Showing the PA Coherency Domain Hardware Implementation Note: Components of the platform within the coherency domain (memory controllers and in-line caches, for example) collectively implement the PA memory model, including the ordering of operations. Special care should be given to configurations for which multiple paths exist between a component that accesses memory and the memory itself, if accesses for which ordering is required are permitted to use different paths.

System Memory System Memory normally consists of dynamic read/write random access memory which is used for the temporary storage of programs and data being operated on by the processor(s). A platform usually provides for the expansion of System Memory via plug-in memory modules and/or memory boards. R1--1. Platforms must provide at least 128 MB of System Memory. (Also see for other requirements which apply to memory within the first 32MB of System Memory.) R1--2. Platforms must support the expansion of System Memory to 2 GB or more. Hardware Implementation Note: These requirements are minimum requirements. Each OS has its own recommended configuration which may be greater. Software Implementation Note: System Memory will be described by the properties of the memory node(s) of the OF device tree.

Memory Mapped I/O (MMIO) and DMA Operations Storage operations which cross the coherency domain boundary are referred to as Memory Mapped I/O (MMIO) operations if they are initiated within the coherency domain, and DMA operations if they are initiated outside the coherency domain and target storage within it. Accesses with targets outside the coherency domain are assumed to be made to IOAs. These accesses are considered performed (or complete) when they complete at the IOA’s I/O bus interface. Bus bridges translate between bus operations on the initiator and target buses. In some cases, there may not be a one-to-one correspondence between initiator and target bus transactions. In these cases, the bridge selects one or a sequence of transactions which most closely matches the meaning of the transaction on the source bus. See also for more details and the appropriate PCI specifications. For MMIO Load and Store instructions, the software needs to set up the WIMG bits appropriately to control Load and Store caching, Store combining, and speculative Load execution to I/O addresses. This architecture does not require platform support of caching of MMIO Load and Store instructions. See the PA for more information. R1--1. For MMIO Load and Store instructions, the hardware outside of the processor must not introduce any reordering of the MMIO instructions for a processor or processor thread which would not be allowed by the PA for the instruction stream executed by the processor or processor thread. Hardware Implementation Note: Requirement may imply that hardware outside of the processor cannot reorder MMIO instructions from the same processor or processor thread, but this depends on the processor implementation. For example, some processor implementations will not allow multiple Loads to be issued when those Loads are to Cache Inhibited and Guarded space (as are MMIO Loads ) or allow multiple Stores to be issued when those Stores are to Cache Inhibited and Guarded space (as are MMIO Stores). In this example, hardware external to the processors could re-order Load instructions with respect to other Load instructions or re-order Store instructions with respect to other Store instructions since they would not be from the same processor or thread. However, hardware outside of the processor must still take care not to re-order Loads with respect to Stores or vice versa, unless the hardware has access to the entire instruction stream to see explicit ordering instructions, like eieio. Hardware outside of the processor includes, but is not limited to, buses, interconnects, bridges, and switches, and includes hardware inside and outside of the coherency domain. R1--2. (Requirement Number Reserved For Compatibility) Apart from the ordering disciplines stated in Requirements and, for PCI the ordering of MMIO Load data return versus buffered DMA data, as defined by Requirement , no other ordering discipline is guaranteed by the system hardware for Load and Store instructions performed by a processor to locations outside the PA coherency domain. Any other ordering discipline, if necessary, must be enforced by software via programming means. The elements of a system outside its coherency domain are not expected to issue explicit PA ordering operations. System hardware must therefore take appropriate action to impose ordering disciplines on storage accesses entering the coherency domain. In general, a strong-ordering rule is enforced on an IOA’s accesses to the same location, and write operations from the same source are completed in a sequentially consistent manner. The exception to this rule is for the special protocol ordering modifiers that may exist in certain I/O bus protocols. An example of such a protocol ordering modifier is the PCI Relaxed Ordering bitThe PCI Relaxed Ordering bit is an optional implementation, from both the IOA and platform perspective. , as indicated in the requirements, below. R1--3. Platforms must guarantee that accesses entering the PA coherency domain that are from the same IOA and to the same location are completed in a sequentially consistent manner, except transactions from PCI-X and PCI Express masters may be reordered when the Relaxed Ordering bit in the transaction is set, as specified in the and . R1--4. Platforms must guarantee that multiple write operations entering the PA coherency domain that are issued by the same IOA are completed in a sequentially consistent manner, except transactions from PCI-X and PCI Express masters may be reordered when the Relaxed Ordering bit in the transaction is set, as specified in the and . R1--5. Platforms must be designed to present I/O DMA writes to the coherency domain in the order required by , except transactions from PCI-X and PCI Express masters may be reordered when the Relaxed Ordering bit in the transaction is set, as specified in the and .

Storage Ordering and I/O Interrupts The conclusion of I/O operations is often communicated to processors via interrupts. For example, at the end of a DMA operation that deposits data in the System Memory, the IOA performing the operation might send an interrupt to the processor. Arrival of the interrupt, however, may be no guarantee that all the data has actually been deposited; some might be on its way. The receiving program must not attempt to read the data from the memory before ensuring that all the data has indeed been deposited. There may be system and I/O subsystem specific method for guaranteeing this. See .

Atomic Update Model An update of a memory location by a processor, involving a Load followed by a Store, can be considered “atomic” if there are no intervening Stores to that location from another processor or mechanism. The PA provides primitives in the form of Load And Reserve and Store Conditional instructions which can be used to determine if the update was indeed atomic. These primitives can be used to emulate operations such as “atomic read-modify-write” and “atomic fetch-and-add.” Operation of the atomic update primitives is based on the concept of “Reservation,”See Book I and II of . which is supported in an LoPAR system via the coherence mechanism. R1--1. Load And Reserve and Store Conditional instructions must not be assumed to be supported for Write-Through storage. Software Implementation Note: To emulate an atomic read-modify-write operation, the instruction pair must access the same storage location, and the location must have the Memory Coherence Required attribute. Hardware Implementation Note: The reservation protocol is defined in Book II of the for atomic updates to locations in the same coherency domain. R1--2. The Load And Reserve and Store Conditional instructions must not be assumed to be supported for Caching-Inhibited storage.

Memory Controllers A Memory Controller responds to the real (physical) addresses produced by a processor or a host bridge for accesses to System Memory. It is responsible for handling the translation from these addresses to the physical memory modules within its configured domain of control. R1--1. Memory controller(s) must support the accessing of System Memory as defined in . R1--2. Memory controller(s) must be fully initialized and set to full power mode prior to the transfer of control to the OS. R1--3. All allocations of System Memory space among memory controllers must have been done prior to the transfer of control to the OS. Software Implementation Note: Memory controller(s) are described by properties of the memory-controller node(s) of the OF device tree.

Cache Memory All of the PA processors include some amount of on-chip or internal cache memory. This architecture allows for cache memory which is external to the processor chip, and this external cache memory forms an extension to internal cache memory. R1--1. If a platform implementation elects not to cache portions of the address map in all external levels of the cache hierarchy, the result of not doing so must be transparent to the operation of the software, other than as a difference in performance. R1--2. All caches must be fully initialized and enabled, and they must have accurate state bits prior to the transfer of control to the OS. R1--3. If an in-line external cache is used, it must support one reservation as defined for the Load And Reserve and Store Conditional instructions. R1--4. For the Symmetric Multiprocessor option: Platforms must implement their cache hierarchy such that all caches at a given level in the cache hierarchy can be flushed and disabled before any caches at the next level which may cache the same data are flushed and disabled (that is, L1 first, then L2, and so on). R1--5. For the Symmetric Multiprocessor option: If a cache implements snarfing, then the cache must be capable of disabling the snarfing during flushing in order to implement the RTAS stop-self function in an atomic way. R1--6. Software must not depend on being able to change a cache from copy-back to write-through. Software Implementation Notes: Each first level cache will be defined via properties of the cpu node(s) of the OF device tree. Each higher level cache will be defined via properties of the l2-cache node(s) of the OF device tree. See for more details. To ensure proper operation, cache(s) at the same level in the cache hierarchy should be flushed and disabled before cache(s) at the next level (that is, L1 first, then L2, and so on).

Memory Status information New OF properties are defined to support the identification and contain the status information on good and bad system memory. R1--1. Firmware must implement all of the properties for memory modules, as specified by , and any other properties defined by this document which apply to memory modules.

Reserved Memory Sections of System Memory may be reserved for usage by OS extensions, with the restrictions detailed below. Memory nodes marked with the special value of the “status” property of “reserved” is not to be used or altered by the base OS. Several different ranges of memory may be marked as “reserved”. If DLPAR of memory is to be supported and growth is expected, then, an address range must be unused between these areas in order to allow growth of these areas. Each area has its own DRC Type (starting at 0, MEM, MEM-1, MEM-2, and so on). Each area has a current and a maximum size, with the current size being the sum of the sizes of the populated DRCs for the area and the max being the sum total of the sizes of all the DRCs for that area. The logical address space allocated is the size of the sum of the all the areas' maximum sizes. Starting with logical real address 0, the address areas are allocated in the following order: OS, DLPAR growth space for OS (if DLPAR is supported), reserved area (if any) followed by the DLPAR growth space for that reserved area (if DLPAR is supported), followed by the next reserved space (if any), and so on. The current memory allocation for each area is allocated contiguously from the beginning of the area. On a boot or reboot, including hypervisor reboot, if there is any data to be preserved (that is, the “ibm,preserved-storage” property exists in the RTAS node), then the starting logical real address of each LMB is maintained through the reboot. The memory in each region can be independently increased or decreased using DLPAR memory functions, when DLPAR is supported. Changes to the current memory allocation for an area results in the addition or removal of memory to the end of the existing memory allocation. Implementation Note: if the shared memory regions are not accessed by the programs, and are just used for DMA most of the time, then the same HPFT hit rate could be achieved with a far lower ration of HPFT entries to logical storage space. R1--1. For the Reserved Memory option: Memory nodes marked with the special value of the “status” property of “reserved” must not be used or altered by the base OS Implementation Note: How areas get chosen to be marked as reserved is beyond the scope of this architecture. R1--2. For the Reserved Memory option with the LRDR option: Each unique memory area that is to be changed independently via DLPAR must have different DRC Types (for example, MEM, MEM-1, and so on).

Persistent Memory Selected regions of storage (LMBs) may be optionally preserved across client program boot cycles. See and .