Non Uniform Memory Access (NUMA) OptionSummary of Extensions to Support NUMANUMA platforms to a first level approximation are simply a large
scale Symmetric Multi-Processor. However to tune system performance and to aid
in platform maintenance, the OS needs additional information and mechanisms.
These include:Associativity -- to determine the platform resource
groupings.Relative Performance Distances -- to determine the performance
between resources within different groupings.Performance Monitor -- to provide usage data on the NUMA
fabric.Dynamic Reconfiguration -- due to such causes as platform upgrade,
reallocation of resources, or a repair of a failure.There are two NUMA support options: the “NUMA” option
and its proper subset the “Associativity Information”
option.NUMA Resource AssociativityAssociativity Codes represent the groupings of the various platform
resources into domains of substantially similar mean performance relative to
resources outside of that domain. Resources subsets of a given domain that
exhibit better performance relative to each other than relative to other
resources subsets, are represented as being members of a sub-grouping domain.
Such sub-domain grouping is represented to any level deemed significant by the
platform design. presents a simple
system configuration with one possible decomposition into associativity
domains. From the decomposition provided the
“ibm,associativity” value string for each resource is
enumerated. The OF Device Tree node for each allocable resource
(processor, memory region, and IO slot) conveys information about the resources
statically assigned to the client program; and contains the
“ibm,associativity”
property (see ). This property allows the client
program to determine the associativity between any two of it’s
resources. The greater the associativity the greater the expected performance
when using those two resources in a given operation.The legal form of the “ibm,associativity”
property is dependent upon the setting of the
“ibm,architecture-vec-5”
property byte 5 bit 0. The bit value of zero allows the
“ibm,associativity” property string to be sequenced in
priority order; this form is being deprecated for new implementations in favor
of the form indicated by the
“ibm,architecture-vec-5”
property byte 5 bit 0 having the value of one in which the
“ibm,associativity” property string represents the
strict physical hierarchy of the platform.When the LPAR option is also implemented, the partition virtual
resources may be mapped onto physical resources with in a very dynamic manor.
Given that the resource mapping to the associativity domain is substantially
consistent, the client program can make use of the associativity information to
on the average optimize performance. If the resource mapping to the
associativity domain is substantially inconsistent, then associativity
information for the resources is not provided to prevent erroneous operation.
If the long term mapping changes the client program can be made aware of the
new associativity information using the ibm,update-properties RTAS call (See
).R1--1.For the NUMA or Associativity
Information option: The platform must include the
“ibm,associativity” in the OF device tree
memory node and the nodes of each processor, memory
region, and PCI bridge onto which IOAs may be plugged if the component is
dedicated to the partition. (The device tree node for a component that the
platform intends to virtualize should include an
“ibm,associativity” property if the associativity
domain information is substantially accurate.)R1--2.For the NUMA option and SPLPAR
option: In the case that both the NUMA and SPLPAR options are
implemented, Requirement is modified
to remove processors from the list of system elements that must include the
respective properties or interfaces described by that requirement. (The
platform is encouraged to provide processor associativity information if it is
substantially accurate.)The “ibm,associativity”
property contains one or more lists of numbers representing the
resource’s platform grouping domains. Each list, starts with a number
representing the domain number of the highest level grouping within which the
platform is capable of supporting direct access. This highest level may be a
NUMA collective or possibly a cluster of machines with direct DMA access.
Successive numbers represent sub-divisions of the previous higher level within
which the expected mean value of the performance relative to outside resources
is substantially similar. Implementations determine the number of levels that
they report, subject to Requirements
and . The lowest level always being
that of the allocable resource itself. The user of this information is
cautioned not to imply any specific physical/logical significance of the
various intermediate levels.R1--3.For the NUMA or Associativity
Information option: Differing levels of resource grouping represented in the
“ibm,associativity” property
must reflect statistically repeatable differences in the expected mean of
measured performance.R1--4.For the NUMA or Associativity
Information option: The expected mean performance of any resource
of a given type within the same grouping domain represented in the
“ibm,associativity” property relative to
resources outside of that grouping domain must be substantially similar.The reason that the “ibm,associativity”
property may contain
multiple associativity lists is that a resource may be multiply connected into
the platform. This resource then has a different associativity characteristics
relative to its multiple connections. To determine the associativity between
any two resources, the OS scans down the two resources associativity lists in
all pair wise combinations counting how many domains are the same until the
first domain where the two list do not agree. The highest such count is the
associativity between the two resources.Relative Performance DistanceAn OS applies its NUMA tuning techniques based upon associativity and
relative performance distance attributes. As a guide to relative performance
distance, RISC Platforms provide the “ibm,associativity-reference-points”
property. The information in this property represents a first order approximation to points
having associativity and relative performance distance characteristics deemed
to be of significant interest to optimizing client program performance.The contents of the “ibm,associativity-reference-points”
property is dependent upon the setting of the
“ibm,architecture-vec-5”
property byte 5 bit 0. The bit value of zero allows the
“ibm,associativity-reference-points” property string
to indicate logical structure points; this form is being deprecated for new
implementations in favor of the form indicated by the
“ibm,architecture-vec-5”
property byte 5 bit 0 having the value of one in which the
“ibm,associativity-reference-points” property string
represents boundaries between associativity domains presented by the
“ibm,associativity” property containing
“near” and “far” resources.R1--1.For the NUMA or Associativity
Information option: The RTAS OF device tree node must contain the
“ibm,associativity-reference-points”.Form 0When the “ibm,architecture-vec-5”
property byte 5 bit 0 has the value of zero, the
“ibm,associativity-reference-points” property defines
reference points in the “ibm,associativity”
property (see ) which roughly correspond to
traditional notions of platform topology constructs. It is important for the
user to realize that these reference points are not exact and their
characteristics vary among implementations. The first integer in the “ibm,associativity-reference-points”
property relates the 1 based ordinal in the associativity lists of the platform’s
“ibm,associativity” property associated
with the traditional notion of a symmetric multi-processor within a NUMA
platform. That is the level that represents building blocks of processors and
memory that have the following characteristics:An OS is likely to view all members having roughly uniform access
characteristics.Represents the highest level before an OS is likely to notice
major Non-Uniformity of access.The second integer in the “ibm,associativity-reference-points”
property relates the 1 based ordinal in the associativity lists of the platform’s
“ibm,associativity” property associated
with the traditional notion of a processor group which is sometimes packaged in
a multi-chip module. A processor group has similar characteristics to an SMP,
however, several processor groups get packaged densely within the same physical
enclosure forming an SMP. While the intra processor group accesses are
measurably greater than inter processor group accesses they are a second order
effect. Subsequent ibm,associativity-reference-points entries are reserved.Form 1When the “ibm,architecture-vec-5”
property byte 5 bit 0 has the value of one, the
“ibm,associativity-reference-points” property
indicates boundaries between associativity domains presented by the
“ibm,associativity” property containing
“near” and “far” resources. The first such boundary
in the list represents the 1 based ordinal in the associativity lists of the
most significant boundary, with subsequent entries indicating progressively
less significant boundaries.Note: Platforms are encouraged to report
boundaries of actual significance. Thus if a platform has only a single
significant boundary to report, the preferred form of the
“ibm,associativ¬ity-reference-points” would
contain a single entry. However, providing two or more entries that reference
the same associativity domains provides equivalent information and is a legal
representation.Dynamic Reconfiguration with Cross CEC I/O DrawersShould the configuration change in such a way that the associativity
between an OS image’s resources changes, the platform notifies the OS
via an event scan log. See . R1--1.For the NUMA or Associativity
Information option: If the platform configuration changes in such a
way that the associativity between an OS image’s resources might have
changed, the platform must notify the OS via an event scan or check exception
log.Maximum Associativity DomainsSince the number of associativity domains that a platform may exhibit
is not apparent from the associativity properties presented at boot time, the
platform provides the “ibm,max-associativity-domains”
property in the /rtas node of the device tree (see
).R1--1.For the NUMA or Associativity
Information option: The platform must provide the
“ibm,max-associativity-domains” property in
the /rtas node of the device tree.Platform Resource Reassignment Notification Option (PRRN)LoPAR platforms that implement the LPAR option are allowed to
transparently reassign the platform resources that are used by a partition. For
instance, if a processor or memory unit is predicted to fail, the platform may
transparently move the processing to an equivalent unused processor or the
memory state to an equivalent unused memory unit. However, reassigning
resources across NUMA boundaries may alter the performance of the partition.
When such reassignment is necessary, the PRRN option provides mechanisms that
inform the supporting OS of changes to the affinity among its platform
resources. It is expected that handling such notifications will involve
significant OS processing, therefore, changing affinity should be avoided, and
when it is necessary to change the affinity of several of the resources owned
by a partition, a single notification after all such changes have occurred is
preferred.The OS and platform firmware negotiate their mutual support of the
PRRN option via the ibm,client-architecture-support
interface (See ). Should a partition be
migrated from a platform that did not support the PRRN option, the target
platform does not notify the partition’s OS of any PRRN events and, when
possible avoids changing the affinity among the partition’s resources.
Partitions that are about to be migrated complete/abort any in-process affinity
change processing prior to the migration, and if the target platform does not
support the PRRN option the partition will simply see no more PRRN
events.A PRRN event is signaled via the RTAS event-scan
mechanism, which returns a Hot Plug Event message “fixed
part” (See ) indicating “Platform
Resource Reassignment”. In response to the Hot Plug Event message, the
OS may call ibm,update-nodes to determine which resources
were reassigned, and then ibm,update-properties to obtain
the new affinity information about those resources.The PRRN event-scan RTAS message contains only
the “fixed part” with the “Type” field set to the
value 160 and no Extended Event Log. The four (4) byte Extended Event Log
Length field is repurposed, since no Extended Event Log message is included, to
pass the “scope” parameter that causes the
ibm,update-nodes to return the nodes affected by the specific
resource reassignment.Requirements:R1--1.For the PRRN Option:
The platform must support the negotiation of the Associativity Information
Option Control Platform Resource Reassignment Notification (Affinity Change)
flag via the ibm,client-architecture-support
interface.R1--2.For the PRRN Option:
If the client code did not claim support for the PRRN option via the
ibm,client-architecture-support interface the platform must not
present PRRN events per section .R1--3.For the combination of the PRRN
and Partition Suspension Options: To avoid firmware function
conflicts the client code must complete or abort any PRRN processing prior to
exercising the Partition Suspension option.R1--4.For the PRRN Option:
The platform must inform the client code of platform resource reassignments via
the event-scan RTAS mechanism with a “fixed
part” only event return message as presented in R1--5.For the PRRN Option:
The platform must support the Platform Resource Reassignment scope (negative of
the value contained in bits 32:64 of the RTAS Event Return Format (Fixed Part)
for PRRN events) input parameter to input the
ibm,update-nodes RTAS call.
RTAS Event Return Format (Fixed Part) for PRRN eventsBit Field Name (bitnumber(s))Description, Values (Described in ) Version (0:7) A distinct value used to identify the architectural version of message Severity (8:10) EVENT (1) RTAS Disposition (11:12) FULLY_RECOVERED(0) Optional_Part_Presence (13) NOT_PRESENT (0): The optional Extended Error Log is not present. Reserved (14:15) 0b00 Initiator (16:19) HOT PLUG (6) Target (20:23) UNKNOWN (0): Not Applicable Type (24:31) Platform Resource Reassignment (160) – includes Change Scope in bits 32:63 Extended Event Log Length / Change Scope (32:63) The scope parameter to be input the ibm,update-nodes RTAS call
to retrieve the nodes that were changed by selected “Hot Plug”
events.