Non Uniform Memory Access (NUMA) Option
Summary of Extensions to Support NUMA NUMA platforms to a first level approximation are simply a large scale Symmetric Multi-Processor. However to tune system performance and to aid in platform maintenance, the OS needs additional information and mechanisms. These include: Associativity -- to determine the platform resource groupings. Relative Performance Distances -- to determine the performance between resources within different groupings. Performance Monitor -- to provide usage data on the NUMA fabric. Dynamic Reconfiguration -- due to such causes as platform upgrade, reallocation of resources, or a repair of a failure. There are two NUMA support options: the “NUMA” option and its proper subset the “Associativity Information” option.
NUMA Resource Associativity Associativity Codes represent the groupings of the various platform resources into domains of substantially similar mean performance relative to resources outside of that domain. Resources subsets of a given domain that exhibit better performance relative to each other than relative to other resources subsets, are represented as being members of a sub-grouping domain. Such sub-domain grouping is represented to any level deemed significant by the platform design. presents a simple system configuration with one possible decomposition into associativity domains. From the decomposition provided the “ibm,associativity” value string for each resource is enumerated.
Example NUMA configuration with domains and corresponding <emphasis role="bold"><literal>“ibm,associativity”</literal></emphasis> values
The OF Device Tree node for each allocable resource (processor, memory region, and IO slot) conveys information about the resources statically assigned to the client program; and contains the “ibm,associativity” property (see ). This property allows the client program to determine the associativity between any two of it’s resources. The greater the associativity the greater the expected performance when using those two resources in a given operation. The legal form of the “ibm,associativity” property is dependent upon the setting of the “ibm,architecture-vec-5” property byte 5 bit 0. The bit value of zero allows the “ibm,associativity” property string to be sequenced in priority order; this form is being deprecated for new implementations in favor of the form indicated by the “ibm,architecture-vec-5” property byte 5 bit 0 having the value of one in which the “ibm,associativity” property string represents the strict physical hierarchy of the platform. When the LPAR option is also implemented, the partition virtual resources may be mapped onto physical resources with in a very dynamic manor. Given that the resource mapping to the associativity domain is substantially consistent, the client program can make use of the associativity information to on the average optimize performance. If the resource mapping to the associativity domain is substantially inconsistent, then associativity information for the resources is not provided to prevent erroneous operation. If the long term mapping changes the client program can be made aware of the new associativity information using the ibm,update-properties RTAS call (See ). R1--1. For the NUMA or Associativity Information option: The platform must include the “ibm,associativity” in the OF device tree memory node and the nodes of each processor, memory region, and PCI bridge onto which IOAs may be plugged if the component is dedicated to the partition. (The device tree node for a component that the platform intends to virtualize should include an “ibm,associativity” property if the associativity domain information is substantially accurate.) R1--2. For the NUMA option and SPLPAR option: In the case that both the NUMA and SPLPAR options are implemented, Requirement is modified to remove processors from the list of system elements that must include the respective properties or interfaces described by that requirement. (The platform is encouraged to provide processor associativity information if it is substantially accurate.) The “ibm,associativity” property contains one or more lists of numbers representing the resource’s platform grouping domains. Each list, starts with a number representing the domain number of the highest level grouping within which the platform is capable of supporting direct access. This highest level may be a NUMA collective or possibly a cluster of machines with direct DMA access. Successive numbers represent sub-divisions of the previous higher level within which the expected mean value of the performance relative to outside resources is substantially similar. Implementations determine the number of levels that they report, subject to Requirements and . The lowest level always being that of the allocable resource itself. The user of this information is cautioned not to imply any specific physical/logical significance of the various intermediate levels. R1--3. For the NUMA or Associativity Information option: Differing levels of resource grouping represented in the “ibm,associativity” property must reflect statistically repeatable differences in the expected mean of measured performance. R1--4. For the NUMA or Associativity Information option: The expected mean performance of any resource of a given type within the same grouping domain represented in the “ibm,associativity” property relative to resources outside of that grouping domain must be substantially similar. The reason that the “ibm,associativity” property may contain multiple associativity lists is that a resource may be multiply connected into the platform. This resource then has a different associativity characteristics relative to its multiple connections. To determine the associativity between any two resources, the OS scans down the two resources associativity lists in all pair wise combinations counting how many domains are the same until the first domain where the two list do not agree. The highest such count is the associativity between the two resources.
Relative Performance Distance An OS applies its NUMA tuning techniques based upon associativity and relative performance distance attributes. As a guide to relative performance distance, RISC Platforms provide the “ibm,associativity-reference-points” property. The information in this property represents a first order approximation to points having associativity and relative performance distance characteristics deemed to be of significant interest to optimizing client program performance. The contents of the “ibm,associativity-reference-points” property is dependent upon the setting of the “ibm,architecture-vec-5” property byte 5 bit 0. The bit value of zero allows the “ibm,associativity-reference-points” property string to indicate logical structure points; this form is being deprecated for new implementations in favor of the form indicated by the “ibm,architecture-vec-5” property byte 5 bit 0 having the value of one in which the “ibm,associativity-reference-points” property string represents boundaries between associativity domains presented by the “ibm,associativity” property containing “near” and “far” resources. R1--1. For the NUMA or Associativity Information option: The RTAS OF device tree node must contain the “ibm,associativity-reference-points”.
Form 0 When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of zero, the “ibm,associativity-reference-points” property defines reference points in the “ibm,associativity” property (see ) which roughly correspond to traditional notions of platform topology constructs. It is important for the user to realize that these reference points are not exact and their characteristics vary among implementations. The first integer in the “ibm,associativity-reference-points” property relates the 1 based ordinal in the associativity lists of the platform’s “ibm,associativity” property associated with the traditional notion of a symmetric multi-processor within a NUMA platform. That is the level that represents building blocks of processors and memory that have the following characteristics: An OS is likely to view all members having roughly uniform access characteristics. Represents the highest level before an OS is likely to notice major Non-Uniformity of access. The second integer in the “ibm,associativity-reference-points” property relates the 1 based ordinal in the associativity lists of the platform’s “ibm,associativity” property associated with the traditional notion of a processor group which is sometimes packaged in a multi-chip module. A processor group has similar characteristics to an SMP, however, several processor groups get packaged densely within the same physical enclosure forming an SMP. While the intra processor group accesses are measurably greater than inter processor group accesses they are a second order effect. Subsequent ibm,associativity-reference-points entries are reserved.
Form 1 When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of one, the “ibm,associativity-reference-points” property indicates boundaries between associativity domains presented by the “ibm,associativity” property containing “near” and “far” resources. The first such boundary in the list represents the 1 based ordinal in the associativity lists of the most significant boundary, with subsequent entries indicating progressively less significant boundaries. Note: Platforms are encouraged to report boundaries of actual significance. Thus if a platform has only a single significant boundary to report, the preferred form of the “ibm,associativ¬ity-reference-points” would contain a single entry. However, providing two or more entries that reference the same associativity domains provides equivalent information and is a legal representation.
Dynamic Reconfiguration with Cross CEC I/O Drawers Should the configuration change in such a way that the associativity between an OS image’s resources changes, the platform notifies the OS via an event scan log. See . R1--1. For the NUMA or Associativity Information option: If the platform configuration changes in such a way that the associativity between an OS image’s resources might have changed, the platform must notify the OS via an event scan or check exception log.
Maximum and Current Associativity Domains Since the number of associativity domains that a platform may exhibit is not apparent from the associativity properties presented at boot time, the platform provides the “ibm,max-associativity-domains” and the “ibm,current-associativity-domains” properties in the /rtas node of the device tree (see ). R1--1. For the NUMA or Associativity Information option: The platform must provide the “ibm,max-associativity-domains” and the “ibm,current-associativity-domains” properties in the /rtas node of the device tree.
Platform Resource Reassignment Notification Option (PRRN) LoPAR platforms that implement the LPAR option are allowed to transparently reassign the platform resources that are used by a partition. For instance, if a processor or memory unit is predicted to fail, the platform may transparently move the processing to an equivalent unused processor or the memory state to an equivalent unused memory unit. However, reassigning resources across NUMA boundaries may alter the performance of the partition. When such reassignment is necessary, the PRRN option provides mechanisms that inform the supporting OS of changes to the affinity among its platform resources. It is expected that handling such notifications will involve significant OS processing, therefore, changing affinity should be avoided, and when it is necessary to change the affinity of several of the resources owned by a partition, a single notification after all such changes have occurred is preferred. The OS and platform firmware negotiate their mutual support of the PRRN option via the ibm,client-architecture-support interface (See ). Should a partition be migrated from a platform that did not support the PRRN option, the target platform does not notify the partition’s OS of any PRRN events and, when possible avoids changing the affinity among the partition’s resources. Partitions that are about to be migrated complete/abort any in-process affinity change processing prior to the migration, and if the target platform does not support the PRRN option the partition will simply see no more PRRN events. A PRRN event is signaled via the RTAS event-scan mechanism, which returns a Hot Plug Event message “fixed part” (See ) indicating “Platform Resource Reassignment”. In response to the Hot Plug Event message, the OS may call ibm,update-nodes to determine which resources were reassigned, and then ibm,update-properties to obtain the new affinity information about those resources. The PRRN event-scan RTAS message contains only the “fixed part” with the “Type” field set to the value 160 and no Extended Event Log. The four (4) byte Extended Event Log Length field is repurposed, since no Extended Event Log message is included, to pass the “scope” parameter that causes the ibm,update-nodes to return the nodes affected by the specific resource reassignment. Requirements: R1--1. For the PRRN Option: The platform must support the negotiation of the Associativity Information Option Control Platform Resource Reassignment Notification (Affinity Change) flag via the ibm,client-architecture-support interface. R1--2. For the PRRN Option: If the client code did not claim support for the PRRN option via the ibm,client-architecture-support interface the platform must not present PRRN events per section . R1--3. For the combination of the PRRN and Partition Suspension Options: To avoid firmware function conflicts the client code must complete or abort any PRRN processing prior to exercising the Partition Suspension option. R1--4. For the PRRN Option: The platform must inform the client code of platform resource reassignments via the event-scan RTAS mechanism with a “fixed part” only event return message as presented in R1--5. For the PRRN Option: The platform must support the Platform Resource Reassignment scope (negative of the value contained in bits 32:64 of the RTAS Event Return Format (Fixed Part) for PRRN events) input parameter to input the ibm,update-nodes RTAS call. RTAS Event Return Format (Fixed Part) for PRRN events Bit Field Name (bitnumber(s)) Description, Values (Described in ) Version (0:7) A distinct value used to identify the architectural version of message Severity (8:10) EVENT (1) RTAS Disposition (11:12) FULLY_RECOVERED(0) Optional_Part_Presence (13) NOT_PRESENT (0): The optional Extended Error Log is not present. Reserved (14:15) 0b00 Initiator (16:19) HOT PLUG (6) Target (20:23) UNKNOWN (0): Not Applicable Type (24:31) Platform Resource Reassignment (160) – includes Change Scope in bits 32:63 Extended Event Log Length / Change Scope (32:63) The scope parameter to be input the ibm,update-nodes RTAS call to retrieve the nodes that were changed by selected “Hot Plug” events.