Logical Partitioning OptionOverviewThe Logical PARtitioning option (LPAR) simultaneously runs one or
more copies of a single OS or multiple heterogeneous LoPAR compliant OSs
on a single LoPAR platform. A partition, within which an OS image runs, is
assigned a non-overlapping sub-set of the platform’s resources. These
platform-allocatable resources include one or more architecturally distinct
processors with their interrupt management area, regions of system memory,
and I/O adapter bus slots. Partition firmware loaded into each partition
generates an OF device tree to represent the resources of the partition to
the OS image. Allocatable resources are directly controlled by an OS. This
architecture restricts the sharing of allocatable resources between
partitions; to do so requires the use of optional facilities presented in
. Platform resources, other than
allocatable resources mentioned above, that are represented by OF nodes in
the device tree of more than one partition (for example, memory controllers
and processor host bridges) are marked ‘used-by-rtas’.Since one of the main purposes of partitioning is isolation of the
OSs, the ability to manage real system resources that are common to all
partitions is modified for the LPAR option. This means that partition use
of RTAS functions which ostensibly use real system resources such as power
and time-of-day clocks are buffered from actual manipulation of those
resources. The RTAS is modified for LPAR, and has hypervisor support, to
virtualize real resources for the partitions. Operational management of the
platform moves to a Hardware Management Console (HMC) which is an
application, either local or remote, that manages platform resources with
messages to the hypervisor rather than being under direct control of a
partition’s OS.Platforms supporting LPAR, contain Power PC processors that support
the hypervisor addressing mode, in which the physical address is equal to
the effective address and all processor resources are available. The
“Real Mode” addressing mode, in these processors, is redefined
to translate and limit the physical addresses that the processor can access
and to restrict access to certain address translation controlling processor
resources. The virtual addressing mode is unchanged. See the
(level 2.0 and beyond) for the
architecture extensions required for the processor.The I/O subsystems of these platforms contain I/O bridges that
restrict the bus addresses that I/O adapters can access. These restricted
bus addresses are subsequently translated through the Translation Control
Entry (TCE) mechanism to restrict Direct Memory Accesses (DMAs) by I/O
devices. This restriction is to system memory allocated to a partition and
managed by the OS image owning the device. The interrupt subsystem of these
platforms is enhanced with multiple (one per partition) global interrupt
queues to direct interrupts to any processor assigned to the I/O
adapter’s owning OS image.Logical Partitioning platforms employ a unique firmware component
called the hypervisor (that runs in hypervisor mode) to manage the address
mapping and common platform hardware facilities, thereby ensuring isolation
between partitions. The OS uses new hypervisor interfaces to manage the
partition’s page frame and TCE tables. The platform firmware utilizes
implementation dependent interfaces to platform hardware common to all
partitions. Thus, a system with LPAR has different OS support than a system
without LPAR.In addition to generating per partition device trees, the OF
component of a logically partitioned platform manages the initial booting
and any subsequent booting of the specific OS image associated with each
partition.The NVRAM of a platform contains configuration variables, policy
options, and working storage that is protected from accesses that might
adversely affect one or more partitions and their OS images. The hypervisor
firmware component restricts an OS image’s access to NVRAM to a
region assigned to its partition. This may restrict the number of
partitions.Most system management on systems without LPAR is performed by OS
based applications that are given access to modify the platform’s
configuration variables, policy options and firmware flash. For various
Reliability Availability and Serviceability (RAS) reasons, LoPAR Logical
Partitioning platforms do not restrict platform operational management
functions to applications running on a preferred partition or OS image.
Access to these Operational Management facilities is provided via a Support
Processor communication port that is connected to an HMC and/or through a
communications port that is connected through a PCI adapter in a partition.
The HMC is a set of applications running in a separate stand-alone platform
or in one of the platform’s partitions. These HMC applications are
required to establish the platform’s LPAR configuration, however, the
configuration is stored in the platform and, therefore, the HMC is not
required to boot or operate the platform in a pre-configured non-error
condition.Real Mode AccessesWhen the OS controlling an LPAR runs with address translation
turned off (MSRDR or MSRIR
bit(s) =0) (real mode) the LPAR hardware
translates the memory addresses to an LPAR unique area known as the Real
Mode Area (RMA). When control is initially passed to the OS from the
platform, the RMA starts at the LPAR's logical address 0 and is the first
logical memory block reported in the LPAR’s device tree. In
general, the RMA is a subset of the LPAR's logical address space.
Attempting a non relocated access beyond the bounds of the RMA results in
an storage interrupt (ISI/DSI depending upon instruction or data
reference). The RMA hardware translation scheme is platform dependent.
The options are given below.Offset and Limit RegistersThe Offset RMA architecture checks the LPAR effective address
against the contents of an RMOL register allowing the access, after
adding an LPAR specific offset to form the real address, if the effective
address is less, else signaling a protection exception.Reserved Virtual AddressesThe platform may map the RMA through the hashed paged table via a
reserved range of virtual addresses. This mapping from the effective
address is done by setting the high order virtual address bits
corresponding to the VSID to the 0b00 || 0x001FFFFFF 1 TB VSID value.
This virtual address is then translated as other virtual addresses. If
the effective address is outside the bounds of the RMA, the storage
interrupt signals a PTEG miss. The platform firmware prepopulates the
LPAR's page frame table with “bolted” entries representing
the real storage blocks that make up the RMA. Note, this method allows
for the RMA to be discontiguous in real address space. The Virtualized
Real Mode Area (VRMA) option gives the OS the ability to dynamically
relocate, expand, and shrink the RMA. See
for more details.General LPAR Reservations and ConventionsThis section documents general LPAR reserved facilities and
conventions. Other sections document reserved facilities and conventions
specific to the function they describe.R1--1.For the LPAR option: To avoid conflict with the
platform’s reserved addresses, the OS must not use the 1 TB (SLB
and PTE B field equal to one) 0b00 || 0x001FFFFFF VSID for purposes other
than virtualizing the RMA.R1--2.For the LPAR option: In order to avoid a storage
exception, the OS must not remove PTEs marked with the
“bolted” indicator (PTE bit 59 = 1) unless the virtual
address space can be referenced by another PTE or the OS does not intend
to access the virtual address space.R1--3.For the LPAR option: To avoid conflict with the
platform’s hypervisor, the OS must be prepared to share use of
SPRG2 as the interrupt scratch register whenever an hcall() is made, or a
machine check or reset interrupt is taken.R1--4.For the LPAR option: If the platform virtualizes the
RMA, prior to transferring control to the OS, the platform must select a
page size for the RMA such that the platform uses only one page table
entry per page table entry group to virtualize the RMA.R1--5.For the LPAR option: If the platform virtualizes the
RMA, prior to transferring control to the OS, the platform must use only
the last page table entry of a page table entry group to virtualize the
RMA.Processor RequirementsR1--1.For the LPAR option: The platform processors must
support the Logical Partitioning (LPAR) facilities as defined in
(Version 2.0 or later).I/O Sub-System RequirementsThe platform divides the I/O subsystem up into Partitionable
Endpoints (PEs). See
for more information on PEs. Each PE has
its own (separate) error, addressing, and interrupt domains which allows
the assignment of separate PEs to different LPAR partitions.The following are the requirements for I/O subsystems when the
platform implements LPAR.R1--1.For the LPAR option: The platform must provide methods
and mechanisms to isolate IOA and I/O bus errors from one PE from affecting
another PE, from affecting a partition to which the PE is not given access
authority by the platform, and from affecting system resources such as the
service processor which are shared between partitions, and must do so with
the EEH option programming model.R1--2.For the LPAR option:
The platform must enable the EEH option for all PEs by default.Software and Firmware Implementation Notes: For the
platform (versus the OS or device driver) to enable EEH, there must be some
assurance that the device drivers are EEH aware, if not EEH enabled. For
example, the device driver or OS may signal its awareness by using
ibm,set-eeh-option RTAS call to enable EEH prior to a
configuration cycle via the
ibm,write-pci-config RTAS call which enables the Memory
Space or IO Space enable bits in the PCI Command register, and firmware can
ignore the
ibm,write-pci-config RTAS call which enables the Memory
Space or IO Space enable bits for an IOA if EEH for that IOA has not been
enabled first. To be EEH aware, a device driver does not need to be able to
recover from an MMIO Stopped and DMA Stopped state, only recognize the
all-1's condition (from a
Load from its IOA or on a PCI configuration read from
its IOA) and not use data from operations that may have occurred since the
last all-1's checkpoint. In addition, the device driver under such failure
circumstances needs to turn off interrupts (using the
ibm,set-int-off RTAS call, or for conventional PCI and
PCI-X infrastructures only: by resetting the IOA and keeping it reset with
ibm,set-slot-reset or
ibm,slot-error-detail) to make sure that any
(unserviceable) interrupts from the IOA do not affect the system (MSIs are
blocked by the EEH DMA Stopped State, but LSIs are not). Note that if
all-1’s data may be valid, the
ibm,read-slot-reset-state2 RTAS call should be used to
discover the true EEH state of the device.R1--3.For the LPAR option: The platform must assign a PE to
one and only one partition at a time.R1--4.For the LPAR option:
The platform must limit the DMA addresses accessible to a PE to the address ranges assigned
to the partition to which the PE is allocated, and, if the PE is used to
implement a VIO device, then also to any allowed redirected DMA address
ranges.Architecture and Implementation Notes:Platforms which do not implement either Requirement
or Requirement
require PE granularity of
everything below the PHB, resulting in poor LPAR partition I/O assignment
granularity.Requirement
has implications in preventing access to both to I/O address
ranges and system memory address ranges. That is, Requirement
requires prevention of peer to peer operations from one IOA to
another IOA when those IOA addresses are not owned by the same partition,
as well as to providing an access protection mechanism to protect system
memory. Note that relative to peer to peer operations, some bridges or
switches may not provide the capabilities to limit peer to peer, and the
use of such bridges or switches require the limitation that all IOAs under
such bridges or switches be assigned to the same partition.R1--5.For the LPAR option: The platform must
provide a PE the capability of accessing all of the System Memory addresses
assigned to the partition to which the PE is allocated.R1--6.For the LPAR option: If TCEs are used to satisfy
Requirements
, and
, then the platform must provide
the capability to map simultaneously and at all times at least 256 MB for
each PE.R1--7.For the LPAR option: If TCEs are used to satisfy
Requirements
, and
, then the platform must prevent
any DMA operations to System Memory addresses which are not translated by
TCEs.R1--8.For the LPAR option: The DMA address range accessible
to a PCI IOA on its I/O bus must be defined by the
“ibm,dma-window” property in its
parent’s OF device tree node.Platform Implementation Note: To maximize the ability to migrate
memory pages underneath active DMA operations, when ever possible, a bridge
should create a bus for a single IOA and either its representing bridge
node should include the
“ibm,dma-window” property specific for the
IOA for conventional PCI or PCI-X IOAs or the IOA function nodes should
contain the
“ibm,my-dma-window” property specific for
the IOA function for PCI Express IOAs. When the configuration of a bus
precludes memory migration, the platform may combine the DMA address for
multiple IOAs that share a bus into a single
“ibm,dma-window” property housed in the
bridge node representing the bridge that creates the shared bus.Interrupt Sub-System RequirementsR1--1.For the LPAR option: The platform must
not assign the same interrupt (LSI or MSI) or same interrupt source number
to different PEs (interrupts cannot be shared between partitions).R1--2.For the LPAR option: The interrupt presentation layer
must support at least one global interrupt queue per platform supported
partition.R1--3.For the LPAR option: The interrupt presentation layer
must separate the per processor interrupt management areas into a separate
4 K pages per processor so that they can each be individually protected by
the PTE mechanism and assigned to their respective assigned
partitions.R1--4.For the LPAR option: The platform must restrict the
processors that service a global queue to those assigned to a single
partition.R1--5.For the LPAR option: If the interrupt source layer
supports message signaled interrupts, the platform must isolate the PCI
Message interrupt Input Port (PMIP) in its own 4 K page of the
platform’s address space.R1--6.For the LPAR option: If the interrupt source layer
supports message signaled interrupts, the hardware must ignore all writes
to the PMIP’s 4 K page except those to the PMIP itself.R1--7.For the LPAR option: If the interrupt source layer
supports message signaled interrupts, the hardware must return all ones on
reads of the PMIP’s 4 K page except those to the PMIP itself.
Signalling a machine check interrupt to the affected processor on a read
that returns all 1s as above is optional.R1--8.For the LPAR option: The interrupt source layer must
support a means in addition to the inter-processor interrupt mechanism for
the hypervisor to signal an interrupt to any processor assigned to a
partition.Software Note: While firmware takes all reasonable
steps to prevent it, it may be possible, on some hardware implementations,
for an OS to erroneously direct an individual IOA’s interrupt to
another partition’s processor. An OS supporting LPAR should ignore
such “Phantom” interrupts.Hypervisor RequirementsThe purpose of the hypervisor is to virtualize the facilities of
platforms, with LPAR, such that multiple copies of a LoPAR compliant OS
may simultaneously run protected from each other in different logical
partitions of the platform. That is, the various OS images may, without
explicit knowledge of each other, boot, run applications, handle exceptions
and events and terminate without affecting each other.The hypervisor is entered by way of three interrupts: the System
Reset Interrupt, the Machine Check Interrupt and System (hypervisor) Call
Interrupt. These use hypervisor interrupt vectors 0x0100, 0x0200, and
0x0C00 respectively. In addition, a processor implementation dependent
interrupt, at its assigned address may cause the hypervisor to be entered.
The return from the hypervisor to the OS is via the rfid (Return from
Interrupt Doubleword) instruction. The target of the rfid (instruction at
the address contained in SRR0) is either a firmware glue routine (in the
case of System Reset or Machine Check) or the instruction immediately
following the invoking hypervisor call. The reason for the firmware glue
routines is that the OS must do its own processing because of the
asynchronous nature of System Reset or Machine Check interruptions. The
firmware glue routine calls an OS registered recovery routine for the
System Reset or Machine Check condition for further details see (reference
to recoverable machine check ACR material to be added when available). The
glue routines are registered by the partition’s OS through RTAS.
Until the glue routines are registered, the OS does not receive direct
reports of either System Reset or Machine Check interrupts but is simply
re-booted by the hypervisor. The glue routines contain a register buffer
area that the hypervisor fills with register values that the glue routine
must pass to the OS when calling the interrupt handler. The last element in
this buffer is a lock word. The lock word is set with the value of the
using processor, and reset by the glue routine just before calling the OS
interrupt handler. This way only one buffer is needed per partition rather
than one per processor.At the invocation of the hypervisor, footprint records are generated
for recovery conditions. Machine Check and Check Stop conditions are, in
some cases, isolatable to the affected partition(s). In these cases, the
hypervisor can then prove that it was not executing changes to the global
system tables on the offending processor when the error occurred. If this
cannot be proven, the global state of the complex is in doubt and the error
cannot be contained. It is anticipated that check stops that only corrupt
the internal state of the affected processor, stop that processor only.
When the service processor subsequently notices the stopped processor it
notifies one of the other processors in the partition through a simulated
recoverable machine check. The hypervisor running on the notified processor
then takes appropriate action to log out and restart the partition, or if
there is an alternate cpu capability, then continue execution with a
substitute for the stopped processor.The following table presents the functions supplied by the
hypervisor.Architecture Note: Some functions performed by partition firmware (OF
and RTAS) require hypervisor assist, but those firmware implementation
dependent interfaces do not appear in this document.
Architected hcall()sFunction Name/SectionComments / Removes a PTE from the partition’s node Page Frame
Table / Removes up to four (4) PTEs from the partition’s node
Page Frame Table / Removes PTEs of a naturally aligned block of Virtual addresses from the partition’s Page Frame Table / Inserts a PTE into the partition’s node Page Frame
Table / Reads the specified PTE from the partition’s node
Page Frame Table / Clears the Modified bit in the specified PTE in the
partition’s node Page Frame Table / Clears the Referenced bit in the specified PTE in the
partition’s node Page Frame Table / Sets the Page Protection and Storage Key bits in the
specified PTE in the partition’s node Page Frame
Table / Prepares for resizing the partition's HPT / Changes the partition's HPT to a new size / Returns the value of the specified DMA Translation Control
Entry / Inserts the specified value into the specified DMA
Translation Control Entry / Inserts the specified value into multiple DMA Translation
Control Entries / Inserts a list of values into the specified range of DMA
Translation Control Entries / SPRG0 is architecturally a hypervisor resource. This call
allows the OS to write the register. / DABR is architecturally a hypervisor resource. This call
allows the OS to write the register. / Initializes pages in real mode either to zero or to the
copied contents of another page. / Manage the Extended DABR facility. / Adjust implementation dependent tuning values / Set implementation dependent tuning switches / Returns the value contained in a cache inhibited logical
address / Stores a value into a cache inhibited logical
address / Returns up to 16 bytes of virtualized console terminal
data. / Sends up to 16 bytes of data to a virtualized console
terminal. / Gets a list of possible client Vterm IOA
connections. / Associates server Vterm IOA to client Vterm IOA. / Breaks association between server Vterm IOA and client
Vterm IOA. / Returns internal hypervisor work areas for code
maintenance. / Clears the hash page table for a partition in preparation for a restart / Generates and End Of Interrupt / Sets the Processor’s Current Interrupt
Priority / Generates an Inter-processor Interrupt / Polls for pending interrupt / Accepts pending interrupt / Get the ESB addresses for a LISN / Assign a target and priority to a LISN / Get the target and priority assigned to a LISN / Get the notification management page for a LISN / Set/Reset an EQ for a target and priority / Get the EQ set for a target and priority / Set the OS reporting cache line pair for a target / Get the OS reporting cache line pair for a target / Load or store operation on the ESB page for a LISN / Issue the requested sync / Reset interrupt state to the initial state / Open a terminal session with a Vterm IOA / Get data from a Vterm session / Put data to a Vterm session / Close an existing session with a Vterm IOA / Migrates the page underneath an active DMA
operation. / Manages the performance monitor facility. / Registers the virtual processor area that contains the
virtual processor dispatch count / Makes processor virtual processor cycles available for
other uses (called when an OS image is idle) / Causes a virtual processor’s cycles to be transferred
to a specified processor. (Called by a blocked OS image to allow
a lock holder to use virtual processor cycles rather than waiting
for the block to clear.) / Awakens a virtual processor that has ceded its
cycles. / Returns the partition’s virtual processor performance
parameters. / Sets the partition’s virtual processor performance
parameters (within constraints). / Returns the value of the virtual processor utilization
register. / Polls the hypervisor for the existence of pending work to
dispatch on the calling processor. / Returns the summation of the physical processor
pool’s idle cycles. / Register Command/Response Queue / Frees the memory associated with the Command/Response
Queue / Controls the virtual interrupt signaling of virtual
IOAs / Sends a message on the Command/Response Queue / Loads a Redirected Remote DMA Remote Translation Control
Entry / Loads a list of Redirected Remote DMA Remote Translation
Control Entries / Unmaps a redirected TCE that was previously built with
H_PUT_RTCE or H_PUT_RTCE_INDIRECT / Allows modification of LIOBN Attributes. / Copies data between partitions as if by TCE mapped
DMA. / Write parameter data to remote DMA buffer. / Read data from remote DMA buffer to return
registers. / Registers the partition’s logical LAN control
structures with the hypervisor / Releases the partition’s logical LAN control
structures / Adds receive buffers to the logical LAN receive buffer
pool / Sends a logical LAN message / Controls the reception and filtering of non-broadcast
multicast packets. / Changes the MAC address for an ILLAN virtual IOA. /Allows modifications of ILLAN Attributes. / Constructs a cookie, specific to the intended client,
representing a shared resource. / Invalidates a cookie representing a shared resource. / Maps a shared resource into the client’s logical
address space / Removes a shared resource from a client’s logical
address space. / Removes receive buffers of specified size from the logical
LAN receive buffer pool. / Allows the partition to manipulate or query certain virtual
IOA behaviors. / Join active threads and return H_CONTINUE to final calling
thread / Use the calling processor to perform platform operations. / Transition VASI operation stream state. / Return the VASI operation stream state. / Reactivate a suspended CRQ. / Change the page mapping characteristics of the Virtualized
Real Mode Area. / Returns Virtual Partition Memory pool statistics / Set Memory Performance Parameters / Get Memory Performance Parameters / Determine Memory Overcommit Performance / Register a Sub-CRQ. / Free a Sub-CRQ. / Send a message to a Sub-CRQ. / Send a list of messages to a Sub-CRQ. / Report the home node associativity for a given virtual
processor / Get the partition energy management parameters / Returns hints for activating/releasing resource instances
to achieve the best energy efficiency. / Registers Subvention Notification Structure / Get a random number / Initiate a co-processor operation / Stop a co-processor operation / Get Extended Memory Performance Parameters / Set Processing resource mode / Search TCE table for entries within a specified
range / Configure Memory Usage Instrumentation / Reset page age and affinity log, and/or PUT/HBA / Return the page usage information for a logical address range of pages / Return multiple HBA / Invalidate the specified process segment from all segment lookaside buffers in the system. / Invalidate the specified process table entry. / Manage the virtual address translation mode including registration of a process table. / Attach a process to a coherent platform function. / Detach a process to a coherent platform function. / Control a coherent platform function. / Collect interrupt information for a coherent platform function. / Control faults for a coherent platform function. / Download an application to a coherent platform function. / Download an application to a coherent platform facility. / Control a coherent platform facility.
System Reset InterruptHypervisor code saves all processor state by saving the contents of
one register in SPRG2 (SPRG1 if
ibm,nmi-register-2 was used) (Multiplexing the use of
this resource with the OS). The processor’s stack and data area are
found by processing the Processor Identification Register.R1--1.For the LPAR option: The platform must support
signalling system reset interrupts to all processors assigned to a
partition.R1--2.For the LPAR option: The platform must support
signalling system reset interrupts individually as well as collectively
to all supported partitions.R1--3.For the LPAR option: The system reset interrupts
signaled to one partition must not affect operations of another
partition.R1--4.For the LPAR option: The hypervisor must intercept
all system reset interrupts.R1--5.For the LPAR option: The platform must implement the
FWNMI option.R1--6.For the LPAR option: The hypervisor must maintain a
count, reset when the partition’s OS, through RTAS, registers for
system reset interrupt notification, of system reset interrupts signaled
to a partition’s processor.R1--7.For the LPAR option: Once the partition’s OS
has registered for system reset interrupt notification, the hypervisor
must forward the first and second system reset interrupts signaled to a
partition’s processor.R1--8.For the LPAR option: The hypervisor must on the third
and all subsequent system reset interrupts signaled to a
partition’s processor invoke OF to initiate the partition’s
reboot policy.R1--9.For the LPAR option:
The hypervisor must have the capability to receive and handle the system reset
interrupts simultaneously on multiple processors in the same or different
partitions up to the number of processors in the system.Machine Check InterruptHypervisor code saves all processor state by saving the contents of
one register in SPRG2 (SPRG1 if
ibm,nmi-register-2 was used) (Multiplexing the use of
this resource with the OS). The processor’s stack and data area are
found by processing the Processor Identification Register.The hypervisor investigates the cause of the machine check. The
cause is either a recoverable event on the current processor, or a
non-recoverable event either on the current processor or one of the other
processors in the logical partition. Also the hypervisor must determine
if the machine check may have corrupted its own internal state (by
looking at the footprints, if any, that were left in the per processor
data area of the errant processor.R1--1.For the LPAR option: The hypervisor must have the
capability to receive and handle the machine check interrupts
simultaneously on multiple processors in the same or different partitions
up to the number of processors in the system.Hypervisor Call InterruptThe hypervisor call (hcall) interrupt is a special variety of the
system call instruction. The parameters to the hcall() are passed in
registers using the PA ABI definitions (Reg 3-12 for parameters). In
contrast to the PA ABI, pass by reference parameters are avoided to or
from hcall(). This minimizes the address translation problem pointer
parameters cause. Some input parameters are indexes. Output parameters,
when generated, are passed in registers 4 through 12 and require special
in-line assembler code on the part of the caller. The first parameter to
hcall() is the function token.
specifies the valid hcall()
function names and token values. Some of the hcall() functions are
optional, to indicate if the platform is in LPAR mode, and which
functions are available on a given platform, the OF property
“ibm,hypertas-functions” is provided in
the /rtas node of the partition’s device tree. The
property is present if the platform is in LPAR mode while its value
specifies which function sets are implemented by a given implementation.
If platform implements any hcall() of a function set it implements the
entire function set. Additionally, certain values of the
“ibm,hypertas-functions” property
indicate that the platform supports a given architecture extension to a
standard hcall().The floating point registers along with the FPSCR are in general
preserved across hcall() functions, unless the “Maintain
FPRs” field of the VPA =0, see
. The general purpose registers
r0 and r3-r12, the CTR and XER registers are volatile along with the
condition register fields 0 and 1 plus 5-7. Specific hcall()s may specify
a more restricted “kill set”, refer to the specific hcall()
specification below.R1--1.For the LPAR option: The platform’s
/rtas node must contain an
“ibm,hypertas-functions” property as
defined below.R1--2.For the LPAR option: If a platform reports in its
“ibm,hypertas-functions” property (see
) that it supports a function set, then it
must support all hcall()s of that function set as defined in
.
Hypervisor Call Function TableHypervisor Call Function
Name/SectionHypervisor Call Function
TokenHypervisor Call Performance
ClassFunction Mandatory?Function SetUNUSED0x0 / 0x4CriticalYeshcall-pft / 0x8CriticalYeshcall-pft / 0xCCriticalYeshcall-pft / 0x10CriticalYeshcall-pft / 0x14CriticalYeshcall-pft / 0x18CriticalYeshcall-pft / 0x1CCriticalYeshcall-tce / 0x20CriticalYeshcall-tce / 0x24CriticalYeshcall-sprg0 / 0x28CriticalYes - if DABR existshcall-dabr / 0x2CCriticalYeshcall-copy / 0x3CNormalYeshcall-debug / 0x40NormalYeshcall-debug / 0x54CriticalYeshcall-term / 0x58CriticalYeshcall-term / 0x60NormalYes if enabled by HMC (default disabled)hcall-dump / 0x64CriticalYeshcall-interrupt / 0x68CriticalYeshcall-interrupt / 0x6CCriticalYeshcall-interrupt / 0x70CriticalYeshcall-interrupt
H_XIRR / 0x74CriticalYeshcall-interrupt
H_XIRR-X / 0x2FCCriticalYeshcall-interrupt / 0x78NormalIf LRDR option is implementedhcall-migrate / 0x7CNormalIf performance monitor is implementedhcall-perfmonReserved0x80 - 0xD8 / 0xDCNormalIf SPLPAR or SLB Shadow Buffer option is
implementedhcall-splparSLB-Buffer / 0xE0CriticalIf SPLPAR option is implementedhcall-splpar / 0xE4CriticalIf SPLPAR option is implementedhcall-splpar / 0xE8CriticalIf SPLPAR option is implementedhcall-splpar / 0xECNormalIf SPLPAR option is implementedhcall-splpar / 0xF0NormalIf SPLPAR option is implementedhcall-splpar / 0xF4CriticalIf SPLPAR option is implementedhcall-splpar / 0xF8NormalIf SPLPAR option is implementedhcall-pic / 0xFCNormalIf VSCSI option is implementedhcall-crq / 0x100NormalIf VSCSI option is implementedhcall-crq / 0x104CriticalIf either the VSCSI or logical LAN option is
implementedhcall-vio / 0x108CriticalIf VSCSI option is implementedhcall-crq / 0x10CCriticalIf VSCSI option is implementedhcall-rdma / 0x110CriticalIf VSCSI option is implementedhcall-rdma / 0x114NormalIf logical LAN option is implementedhcall-lLAN / 0x118NormalIf logical LAN option is implementedhcall-lLAN / 0x11CCriticalIf logical LAN option is implementedhcall-lLAN / 0x120CriticalIf logical LAN option is implementedhcall-lLAN / 0x124CriticalNew designs as of 01/01/2003hcall-bulk / 0x128CriticalIf VSCSI option is implementedhcall-rdma / 0x12CCriticalIf VSCSI option is implementedhcall-rdma / 0x130CriticalIf logical LAN option is implementedhcall-lLAN / 0x134NormalIf Extended DABR option is implementedhcall-xdabr / 0x138Criticalhcall-multi-tce / 0x13CCriticalhcall-multi-tce / 0x140Criticalhcall-multi-tceReserved0x144Reserved0x148 / 0x14CNormalIf Logical LAN option is implementedhcall-ILAN / 0x150NormalIf the Server Vterm option is implementedhcall-vty / 0x154NormalIf the Server Vterm option is implementedhcall-vty / 0x158NormalIf the Server Vterm option is implementedhcall-vty / 0x1C4NormalIf Shared Logical Resource option is Implementedhcall-slr / 0x1C8NormalIf Shared Logical Resource option is Implementedhcall-slr / 0x1CCNormalIf Shared Logical Resource option is Implementedhcall-slr / 0x1D0NormalIf Shared Logical Resource option is Implementedhcall-slr / 0x1D4CriticalIf logical LAN option is implementedhcall-lLAN / 0x1D8CriticalIf SPLPAR option is implementedhcall-poll-pendingReserved0x1DC - 0x1E0VariesReserved0x1E8 - 0x1ECReserved0x1F0 - 0x23CVaries / 0x240NormalIf LIOBN Attributes are implementedhcall-liobn-attributes / 0x244NormalIf ILLAN Checksum Offload Support is implementedIf ILLAN Backup Trunk Adapter option is
implementedhcall-illan-optionsReserved0x248 / 0x24CCriticalIf H_PUT_RTCE is implementedhcall-rdmaReserved0x27CReserved0x280Reserved0x28C-0x294 / 0x298NormalIf Thread Join option is implementedhcall-join / 0x29CNormalIf VASI option is implementedhcall-vasi / 0x2A0NormalIf VASI option is implemented hcall-vasi / 0x2A4NormalIf VASI option is implementedhcall-vasi / 0x2A8NormalIf any virtual I/O options are implementedhcall-vioctl / 0x2ACNormalIf the VRMA option is implemented.hcall-vrma / 0x2B0ContinuedIf partition suspension option is implementedhcall-suspendReserved0x2B4 / 0x2B8NormalIf the Partition Energy Management Option is
implementedhcall-get-emparm / 0x2BCNormalIf the Cooperative Memory Over-commitment Option is
implementedhcall-cmo / 0x2D0NormalIf the Cooperative Memory Over-commitment Option is
implementedhcall-cmo / 0x2D4NormalIf the Cooperative Memory Over-commitment Option is
implementedhcall-cmo / 0x2D8NormalIf the Cooperative Memory Over-commitment Option is
implemented and the calling partition is authorized.hcall-cmo / 0x2DCNormalIf the Subordinate CRQ Option is implementedhcall-sub-crq / 0x2E0NormalIf the Subordinate CRQ Option is implementedhcall-sub-crq / 0x2E4NormalIf the Subordinate CRQ Option is implementedhcall-sub-crq / 0x2E8NormalIf the Subordinate CRQ Option is implementedhcall-sub-crq / 0x2ECNormalIf the VPHN Option is implementedhcall-vphnReserved0x2F0 / 0x2F4NormalIf the Partition Energy Management Option is
implementedhcall-best-energy-1<list>The <list> suffix for hcall-best-energy indicates
an optional dash delimited series (may be null) of supported
resource codes encoded as ASCII decimal values in addition to
the minimal support value of 1 for processors, other values
are define in
. / 0x2F8NormalIf the Expropriation Subvention Notification Option is
implementedhcall-esn
X_XIRR-X / 0x2FCCriticalYeshcall-interrupt / 0x300NormalIf a random number generator Platform Facilities Option
is implementedhcall-randomReserved0x310 / 0x304NormalIf one or more Coprocessor Platform Facilities Options
are implementedhcall-cop / 0x308NormalIf one or more Coprocessor Platform Facilities Options
are implementedhcall-cop / 0x314NormalIf the Extended Cooperative Memory Overcommittment Option
is implementedhcall-cmo-x / 0x31CNormalIf the platform supports POWER ISA version 2.07 or
higherhcall-set-modeReserved0x320 / 0x324normalIf the plaform implements the LRDR option at LoPAR
Version 2.7 or higherhcall-xlates-limited / 0x328Criticalif the platform supports the "block invalidate" optionhcall-block-remove / 0x32CnormalIf the Memory Usage Instrumentation Option is implementedhcall-mui / 0x330normalIf the Memory Usage Instrumentation Option is implementedhcall-mui / 0x334normalIf the Memory Usage Instrumentation Option is implementedhcall-mui / 0x338normalIf the Memory Usage Instrumentation Option is implementedhcall-mui / 0x33CNormalIf the plaform implements the LRDR option at LoPAR Verstion 2.8 or higherhcall-implementation-dependent-tuning / 0x340NormalIf the plaform implements the LRDR option at LoPAR Verstion 2.8 or higherhcall-implementation-dependent-tuning / 0x344NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x348NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x34CNormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x350NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x354NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x358TerminalFor LoPAR Verstion 2.8 and higherhcall-clr-hpt / 0x35CNormalIf one or more Coherent Platform Facilities Options are implementedhcall-caReserved0x360 / 0x364NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x368NormalIf one or more Coherent Platform Facilities Options are implementedhcall-ca / 0x36CNormalIf the platform supports the Hash Page Table Resize Optionhcall-hpt-resize / 0x370NormalIf the platform supports the Hash Page Table Resize Optionhcall-hpt-resize / 0x374NormalIf the platform supports the In-Memory Table Translation Optionhcall-imtt / 0x378NormalIf the platform supports the In-Memory Table Translation Optionhcall-imtt / 0x37CNormalIf the platform supports the In-Memory Table Translation Optionhcall-imtt / 0x3A8NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3ACNormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3B0NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3B4NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3B8NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3BCNormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3C0NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3C4NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3C8NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3CCNormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x3D0NormalIf the OS enabled XIVE exploitationhcall-int-exploitation / 0x408NormalIf VSM is implementedhcall-vsm / 0x40CCriticalIf VSM is implementedhcall-vsm / 0x410CriticalIf VSM is implementedhcall-vsm / 0x414NormalIf VSM is implementedhcall-vsmhcalls to support an Ultravisor0xEF00 - 0xEF80Reserved for platform-dependent hcall()s /
0xF000 - 0xFFFCILLEGALAny token value having a one in either of the low order
two bitsReserved0x380 - 0xEEFF and 0x10000 - 0xFFFFFFFF-FFFFFFFC: RTAS
implementations may assigns values in these ranges to their own
internal interfaces, as long as they are prepared for the
growth of architected functions into this range.
Firmware Implementation Note: The assignment of function tokens is
designed such that a single mask operation can validate that the value is
within the range of a reasonable size branch table. Entries within the
branch table can handle unimplemented code points.The hypervisor routines are optimized for execution speed. In some
rare cases, locks are taken, and specific hardware designs require short
wait loops. However, if a needed resource is truly busy, or processing is
required by an agent, the hypervisor returns to the caller, either to
have the function retried or continued at a later time. The Performance
Class establishes specific performance requirements against each specific
hcall() function as defined below.Hypervisor Call Performance Classes:Critical Must make continuous forward progress, encountering any
busy resource must cause the function to back out and return with a
“hardware busy” return code. When subsequently called, the
operation begins again. Short loops for larwx and stwcx to acquire an
apparently unheld lock are allowed. These functions may not include wait
loops for slow hardware access.Normal Similar to critical, however, wait loops for slow hardware
access are allowed. These functions may not include wait loops for an
agent such as an external micro-processor or message transmission
device.Continued This class of functions is expected to serialize on the
use of external agents. If the external agent is busy the function
returns “hardware busy”. If the interface to the external
agent is not busy, the interface is marked busy and used to start the
function. The function returns one of the “function in
progress” return codes. Later, the caller may check on the
completion of the function by issuing the “check” Hcall
function with the “function in progress” parameter code. If
the function completed properly, the hypervisor maintains no status and
the “check” Hcall returns success. If the operation is still
in process, the same “function in progress” code is returned.
If the function completed in error, the completion error code is
returned. The hypervisor maintains room for at least one outstanding
error status per external agent interface per processor. If there is no
room to record the error status, the hypervisor returns “hardware
busy” and does not start the function.Terminal This class of functions is used to manage a partition when
the OS is not in regular operation. These events include postmortems and
extensive recoveries.The hypervisor performance classes are ordered in decreasing
restriction.R1--3.For the LPAR option: The caller must perform properly
given that the hypervisor meets the performance class specified.R1--4.For the LPAR option: The hypervisor implementation
must meet the specified performance class or higher.R1--5.For the LPAR option: Platform hardware designs must
take the allowable performance classes into account when choosing the
hardware access technology for the various facilities.R1--6.For the LPAR option: The hypervisor must have the
capability to receive and handle the hypervisor call interrupts
simultaneously on multiple processors in the same or different partitions
up to the number of processors in the system.R1--7.For the LPAR option: The hypervisor must check the
state of the MSR register bits that are not set to a specific value by
the processor hardware during the invoking interrupt per
.
MSR State on Entrance to HypervisorMSR BitRequired StateError-CodeHV - Hypervisor1NoneBits 2,4:46, 57, and 60 ReservedSet to 0 by HardwareNoneILE - Interrupt Little EndianAs Last set by the hypervisorNoneME - Machine check EnableAs last set by the hypervisorNoneLE Little-Endian Mode0 forced by ILENone
R1--8.For the LPAR option: The Hcall() flags field must
meet the definition in:
; the hypervisor may safely
ignore flag field values not explicitly defined by the specific hcall()
semantic.R1--9.For the LPAR option: The platform must ensure that
flag field values not defined for a specific hcall() do not compromise
partitioning integrity.R1--10.For the LPAR option: Implementations that normally
choose to ignore invalid flag field values must provide a “debug
mode” that does check for invalid flag field values and returns
H_Parameter when any are found.Architecture Note: The method for invocation of a
platform’s “debug mode” is beyond the scope of this
architecture.
Page Frame Table Access flags field
definitionBitFunctionBitFunctionBitFunctionBitFunction0-15NUMA CEC Cookie16-23Subfunction Codes32AVPN48Zero Page33andcond49Copy Page34-39Reserved50-54key0-key4Bits 50-54 (key0 - key4) shall be treated as reserved
on platforms that either do not contain an
“ibm,processor-storage-keys” property,
or contain an
“ibm,processor-storage-keys” property
with the value of zero in both cells.55pp0Bit 55 (pp0) shall be treated as reserved on platforms
that do not have the “Support for the “110”
value of the Page Protection (PP) bits” bit set to a
value of 1 in the
“ibm,pa-features” property.24Exact40I-Cache-Invalidate56Compression25R-XLATE41I-Cache-Synchronize57Checksum26READ-442CC (Coalesce Candidate)58-60Reserved27Reserved43-46MUI OptionsBits 43-46 (MUI Options) detail is provided in
in .28-31CMO Option Flags61N62pp147Reserved63pp2
R1--11.For the LPAR option: The caller of Hcall must be in
privileged mode (MSRPR = 0) or the hypervisor immediately returns an
H_Privilege return code. See
for this and other architected
return codes.R1--12.For the LPAR option: The caller of hcall() must be
prepared for a return code of H_Hardware from all functions.R1--13.For the LPAR option:
In order for the platform to return H_Hardware, the error must not have resulted in an
undetectable state/data corruption nor will continued operation propagate
an undetectable state/data corruption as a result of the original
error.Notes:A detectable corruption, when accessed, results in either a
H_Hardware return code, machine check or check stop per platform
policy.Among other implications of Requirement
are: the effective state of the
partition appears to not change due to the failed hcall() -- (any partial
changes to persistent state/data are backed out); and the recovery of
platform resources that held lost state/data does not hide the state/data
loss to subsequent users of that state/data.The operating system is not expected to log a serviceable event
due to an H_Hardware return code from an hcall(), and treats the hcall()
as failing due to nonspecific hardware reasons. Any logging of a
serviceable event in response to the underlying cause is handled by
separate platform initiated operations.
Hypervisor Call Return Code TableHypervisor Call Return Code Values
(R3)Meaning0x0100000 - 0x0FFFFFFFFunction in Progress9905H_LongBusyOrder100sec - Similar to LongBusyOrder1msec,
but the hint is 100 second wait this time.9904H_LongBusyOrder10sec - Similar to LongBusyOrder1msec, but
the hint is 10 second wait this time.9903H_LongBusyOrder1Sec - Similar to LongBusyOrder1msec, but
the hint is 1 second wait this time.9902H_LongBusyOrder100mSec - Similar to LongBusyOrder1msec,
but the hint is 100mSec wait this time.9901H_LongBusyOrder10mSec - Similar to LongBusyOrder1msec,
but the hint is 10mSec wait this time.9900H_LongBusyOrder1msec - This return code is identical to
H_Busy, but with the added bonus of a hint to the partition OS.
If the partition OS can delay for 1 millisecond, the hcall will
likely succeed on a new hcall with no further busy return
codes. If the partition OS cannot handle a delay, they are
certainly free to immediately turn around and try again.18H_CONTINUE17H_PENDING16H_PARTIAL_STORE15H_PAGE_REGISTERED14H_IN_PROGRESS13Sensor value >= Critical high12Sensor value >= Warning high11Sensor value normal10Sensor value <= Warning low9Sensor value <= Critical low5H_PARTIAL (The request completed only partially
successful. Parameters were valid but some specific hcall
function condition prevented fully completing the architected
function, see the specific hcall definition for possible
reasons.)4H_Constrained (The request called for resources in excess
of the maximum allowed. The resultant allocation was
constrained to maximum allowed)3H_NOT_AVAILABLE2H_Closed (virtual I/O connection is closed)1H_BusyHardware Busy -- Retry Later0H_Success-1H_Hardware (Error)-2H_Function (Not Supported)-3H_Privilege (Caller not in privileged mode).-4H_Parameter (Outside Valid Range for Partition or
conflicting)-5bad_mode (Illegal MSR value)-6H_PTEG_FULL (The requested pteg was full)-7H_Not_Found (The requested entity was not found)-8H_RESERVED_DABR (The requested address is reserved by the
hypervisor on this processor)-9H_NOMEM-10H_AUTHORITY (The caller did not have authority to perform
the function)-11H_Permission (The mapping specified by the request does
not allow for the requested transfer)-12H_Dropped (One or more packets could not be delivered to
their requested destinations)-13H_S_Parm (The source parameter is illegal)-14H_D_Parm (The destination parameter is illegal)-15H_R_Parm (The remote TCE mapping is illegal)-16H_Resource (One or more required resources are in
use)-17H_ADAPTER_PARM (invalid adapter)-18H_RH_PARM (resource not valid or logical partition
conflicting)-19H_RCQ_PARM (RCQ not valid or logical partition
conflicting)-20H_SCQ_PARM (SCQ not valid or logical partition
conflicting)-21H_EQ_PARM (EQ not valid or logical partition
conflicting)-22H_RT_PARM (invalid resource type)-23H_ST_PARM (invalid service type)-24H_SIGT_PARM (invalid signalling type)-25H_TOKEN_PARM (invalid token)-27H_MLENGTH_PARM (invalid memory length)-28H_MEM_PARM (invalid memory I/O virtual address)-29H_MEM_ACCESS_PARM (invalid memory access control)-30H_ATTR_PARM (invalid attribute value)-31H_PORT_PARM (invalid port number)-32H_MCG_PARM (invalid multicast group)-33H_VL_PARM (invalid virtual lane)-34H_TSIZE_PARM (invalid trace size)-35H_TRACE_PARM (invalid trace buffer)-36H_TRACE_PARM (invalid trace buffer)-37H_MASK_PARM (invalid mask value)-38H_MCG_FULL (multicast attachments exceeded)-39H_ALIAS_EXIST (alias QP already defined)-40H_P_COUNTER (invalid counter specification)-41H_TABLE_FULL (resource page table full)-42H_ALT_TABLE (alternate table already exists / alternate
page table not available)-43H_MR_CONDITION (invalid memory region condition)-44H_NOT_ENOUGH_RESOURCES (insufficient resources)-45H_R_STATE (invalid resource state condition or sequencing
error)-46H_RESCINDED-54H_Aborted-55H_P2-56H_P3-57H_P4-58H_P5-59H_P6-60H_P7-61H_P8-62H_P9-63H_NOOP-64H_TOO_BIG-65Reserved-66Reserved-67H_UNSUPPORTED (Parameter value outside of the range
supported by this implementation)-68H_OVERLAP (unsupported overlap among passed buffer
areas)-69H_INTERRUPT (Interrupt specification is invalid)-70H_BAD_DATA (uncorrectable data error)-71H_NOT_ACTIVE (Not associated with an active
operation)-72H_SG_LIST (A scatter/gather list element is
invalid)-73H_OP_MODE (There is a conflict betweenthe subcommand and the requested operation
notification)-74H_COP_HW (co-processor hardware error)-75H_STATE (invalid state)-76H_RESERVED (a reserved value was specified)-77H_IN_USE (a specified resource is already in use)-78 : -255Reserved-256 -- -511H_UNSUPPORTED_FLAG (An unsupported binary flag bit was
specified. The returned value is -256 - the bit position of the
unsupported flag bit [high order flag bit is 0 etc.])
Hypervisor Call FunctionsPage Frame Table AccessAll hypervisor Page Frame Table (PFT) access routines are called
using 64 bit linkage conventions and apply to all page sizes that the
platform supports as specified by the
“ibm,processor-page-sizes” property. (See
for more details.)
The Page actual size is encoded in the PFT entry
per the architecture Book IIIs along with the
segment base page size per the
Book IVa.
The hypervisor PFT
access functions carefully update a given Page Table Entry (PTE) with at
least 64 bit store operations since an invalid update sequence could
result in machine checks. To guard against multiple conflicting
allocations of a PTE that could result in a check stop condition, the
hypervisor PTE allocation routine (H_ENTER) reserves the first two (high
order) software PTE bits for use as PTE locks while the low order two
software PTE bits are reserved for OS use (not used by firmware). If a
firmware PTE bit is on, the OS is to assume that the PTE is in use, just
as if the V bit were on. The hypervisor PFT access routines often execute
the tlbie instruction, on certain platforms, this instruction may only be
executed by one processor in a partition at a time, the hypervisor uses
locks to assure this. The tlbie instruction flushes a specific translate
lookaside buffer (TLB) entry from all processors participating in the
protocol. All the processors participating in the tlbie protocol are
defined as a translation domain. All processors of a given partition that
are in a given translation domain share the same hardware PFT. Book III
of the PA specifies the codes sequences needed to safely access the PFT,
in its chapter titled “Storage Control Instructions and Table
Updates”. These code sequences are part of this specification by
reference. The hypervisor PFT access routines are in the critical
performance path of the machine, therefore, extraordinary care must be
given to their performance, including machine dependent coding, minimal
run time checking, and code path length optimization. For performance
reasons, all parameter linkage is through registers, and no indirect
parameter linkage is allowed. This requires special glue code on the part
of the caller to pick up the return parameters. The hypervisor PFT access
routines modify the calling processor’s partition PFT on the
calling node. On NUMA systems, if an LPAR partition spans multiple
Central Electronics Complexes (CECs), the partition’s processors
may be in separate translation domains. Each platform translation domain
has a separate PFT. Therefore, the partition’s OS must modify each
PFT individually. This is done either by making hcall() accesses
specifying the NUMA CEC Cookie (which identifies the translation domain)
in the high order 16 bits of the flags parameter (H_ENTER and H_READ
only) or by issuing the hcall() from a processor within the translation
domain as identified by the processor’s NUMA CEC Cookie field of
the
“ibm,pft-size” property.The PFT is preallocated based upon the value of the
partition’s PFT_size configuration variable. This configuration
variable is initialized to 4 PTEs per node local page frame and 2 PTEs
per remote node page frame. The size of the PFT per node is communicated
to the partition’s OS image via the
“ibm,pft-size” property of the
node.The value of the configuration variable
PFT_size consists of two comma separated integers,
the first is the number of hardware PFT entries to allocate per CEC local
page, and the second is the number of hardware PFT entries to allocate
per remote CEC page (if NUMA configured). These allocations are made at
partition boot time based upon the initial partition memory allocation,
based upon specific situations (such as low page table usage or future
need for dynamic memory addition) the OS may wish to override the
platform default values.R1--1.For the LPAR option: The platform must allocate the
partition’s page frame table. The size of this table is determined
by the PFT_size configuration variable in the OS image’s
“common” NVRAM partition.R1--2.For the LPAR option: The platform must provide the
“ibm,pft-size” property in the processor
nodes of the device tree (children of type
cpu of the
/cpus node).Register Linkage (For hcall() tokens 0x04 - 0x18)On Call:R3 function call tokenR4 flags (see )R5 Page Table Entry Index (PTEX)R6 Page Table Entry High word (PTEH) (on H_ENTER only)R7 Page Table Entry Low word (PTEL) (on H_ENTER only)On Return:R3 Status WordR4 chosen PTEX (from H_ENTER) / High Order Half of old PTER5 Low Order Half of old PTER6Semantics checks for all hypervisor PTE access routines:Hypervisor checks that the caller was in privileged mode or
H_Privilege return code.On NUMA platforms for the H_ENTER and H_READ calls only, the
hypervisor checks that the NUMA CEC Cookie is within the range of values
assigned to the partition else return H_Parameter.Hypervisor checks that the PTEX is zero or greater and less than
the partition maximum, else H_Parameter return code.Hypervisor checks the logical address contained in any PTE to be
entered into the PFT to insure that it is valid and then translates the
logical address into the assigned physical address.When hypervisor returns the contents of a PTE, the contents of
the RPN are usually architecturally undefined. It is expected that
hypervisor implementations leave the contents of this field as it was
read from the PTE since it cannot be used by the OS to directly access
real memory. The exception to this rule is when the R-XLATE flag is
specified to the H_READ hcall(), then the RPN in the PTE is reverse
translated into the LPN prior to return.Logical addressing:LPAR adds another level of virtual address translation managed by
the hypervisor. The OS is never allowed to use the physical address of
its memory this includes System Memory, MMIO space, NVRAM etc. The OS
sees System Memory as N regions of contiguous logical memory. Each
logical region is mapped by the hypervisor into a corresponding block of
contiguous physical memory on a specific node. All regions on a specific
system are the same size though different systems with different amount
of memory may have different region sizes since they are the quantum of
memory allocation to partitions. That is, partitions are granted memory
in region size chunks and if a partition’s OS gives up memory, it
is in units of a full region. On NUMA platforms, groups of regions may be
associated with groups of processors forming logical CECs for allocation
and migration purposes.Logical addresses are divided into two fields, the logical region
identifier and the region offset. The region offset is the low order bits
needed to represent the region size. The logical region identifier are
the remaining high order bits.Logical addresses start at zero. When control is initially passed
to the OS from the platform, the first region is the single RMA. The
first region has logical region identifier of zero. This first region is
specified by the first address - length pair of the
“reg” property of the
/memory node of the OF device tree. Subsequent
regions each have their own address - length pair. At initial program
load time, the logical region identifiers are sequential starting at zero
but over time, with dynamic memory reconfiguration, holes may appear in
the partition’s address space.Logical to physical translation: This translation is based upon a
simple indexed table per partition of the physical addresses associated
with the start of each region (in logical region identifier order). At
least two special values are recognized:The invalid value for those regions that do not have a physical
mapping (so that there can be holes in the logical address map for
various reasons such as memory expansion).The I/O region value, that calls for further checking against
partition I/O address range allocations.The logical region identifier is checked for being less than the
maximum size, and then used to index the logical to physical translation
table.If the physical region identifier is valid (certain values are
reserved say 0 and all F’s) then it replaces the logical region
identifier in the PTE and the PTE access function continues.If the physical region identifier is the I/O region, then proceed
to the I/O translation algorithm (implementation dependent based upon
platform characteristics).If the physical region identifier is invalid, return
H_ParameterR1--3.For the LPAR option: The OS must make no assumptions
about the logical to physical mapping other than the low order
bits.R1--4.For the LPAR option: Each logical region must have
its own address - length pair in the
“reg” property of the OF
/memory node.R1--5.For the LPAR option: When control is initially passed
to the OS from the platform, the first logical region (having logical
region identifier 0) must be the region accessed when the OS operates
with translate off.R1--6.For the LPAR option: When control is initially passed
to the OS from the platform, the size of the logical region must be equal
to a real mode length size supported by the platform.R1--7.For the LPAR option:
Each logical region must start and end on a boundary of the largest page size that the
logical region supports (see
“ibm,dynamic-memory” and
“ibm,lmb-page-sizes” in
as well as
for more details).R1--8.For the LPAR option: The pages that contain the
platform’s per processor interrupt management areas or any other
device marked
“used-by-rtas” must not be mapped into
the partition virtual address space.R1--9.For the LPAR option:
Each logical region must support all page sizes presented in the
“ibm,processor-page-sizes” property in
that are less than or equal to the
size of the logical region as specified by either the OF standard
“reg” property of the logical
region’s OF
/memory node, or the
“ibm,lmb-size” property of the logical
region’s
/ibm,dynamic-reconfiguration-memory node in
.Implementation Note: 32 bit versions of AIX only support 36 bit
logical address memory spaces. Providing such a partition with a larger
logical memory address space may cause OS failures.Implementation Note: Requirement
may be met by ensuring that all
logical regions start and end on a boundary of the largest page size
supported by the platform.H_REMOVEThis hcall is for invalidating an entry in the page table. The PTEX
identifies a specific page table entry. If the PFO option is implemented
an optional flag causes the hypervisor to compress the page contents to
one or more data blocks after invalidating the page table entry given
that a compression coprocessor is available and the page is small enough
to be synchronously compressed. If the compression coprocessor is busy,
or the page is too large, the compression can be subsequently performed
using the H_COP_OP hcall() see
. If the page contents are
compressed, then a checksum may be appended by setting the checksum flag
- if the compression flag is not set the checksum flag is
ignored.Syntax:Parameters:flags: AVPN, andcond, and for the CMO option: CMO Option flags as
defined in
and for the PFO option the
compression and checksum flags.PTEX (index of the PTE in the page table to be used)AVPN: Optional “Abbreviated Virtual Page Number” --
used as a check for the correct PTEWhen the AVPN flag is set, the contents of the AVPN parameter are
compared to the first double word of the PTE (after bits 57-63 of the PTE
have been masked). Note, the low order 7 bits are undefined and should be
zero otherwise the likely result is a return code of H_Not_Found.When the andcond flag is set, the contents of the AVPN parameter
are bit anded with the first double word of the PTE. If the result is
non-zero the return code is H_Not_Found.out: For the PFO option, the output data block logical real
address when the compression flag bit is on.outlen: For the PFO option, the length of the compression data
block or compression data block descriptor list when the compression flag
bit is on.Semantics:Check that the PTEX accesses within the PFT else return
H_ParameterIf the AVPN flag is set, and the AVPN parameter bits 0-56 do not
match that of the specified PTE then return H_Not_Found.If the andcond flag is set, the AVPN parameter is bit anded with
the first double word of the specified PTE, if the result is non-zero,
then return H_Not_Found.The hypervisor Synchronizes the PTE specified by the PTEX and
returns its valueUse the architected “Deleting a Page Table Entry”
sequence such that the first double word of the resultant PFT entry is
all 0s.Use the proper tlbie instruction for the page size within a
critical section protected by the proper lock (per large page bit in the
specified PTE).The synchronized value of the old PTE value ends up in R4 and R5
for return to the caller.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.For the PFO option: If the Compression flag is on:Check that the calling partition is authorized to use the
compression co-processor else return H_Function.If the page is not “main store memory” then return
H_UNSUPPORTED_FLAG TBD (value - 312)Check that the page size is <= the compression value in
“ibm,max-sync-cop” else return
H_Constrained.Build CRB for compression of the page size indicated in the
PTEIf the checksum flag is on command that a checksum be
builtVerify that the “out” parameter represents a valid
logical real address within the caller’s partition else return
H_P3If the “outlen” parameter is non-negative verify that
the logical real address of (out + outlen) is a valid logical real
address within the same 4K page as the “out” parameter else
return H_P4.If the “outlen” parameter is negative:Verify that the absolute value of outlen meet all of the follow
else return H_P4:Is <= the value of
“ibm,max-sg-len”Is an even multiple of 16That out + the absolute value of outlen represents a valid
logical real address within the same 4K page as the out parameter.Verify that each 16 byte scatter gather list entry meets all of
the following else return H_SG_LIST:Verify that the first 8 bytes represents a valid logical real
address within the caller’s partition.Verify that the logical real address represented by the sum of
the first 8 bytes and the second 8 bytes is a valid logical real address
within the same 4K page as the first 8 bytes.For the Shared Logical Resource Option if any of the memory
represented by the out/outlen parameters have been rescinded then return
H_RESCINDED.Fill in the destination DDE list from the converted the
out/outlen parameters.Issue icswx instruction to execute CRBCheck coprocessor busy - retry / return H_PARTIAL if
execution time expired / return H_COP_HW if compressor is brokenWait for coprocessor to completeIf compressor hardware error return H_COP_HWCheck that the compressor had enough room to house the compressed
image else return H_TOO_BIGSave compression block size in R6Return H_SuccessH_ENTERThis hcall adds an entry into the page frame table. PTEH and PTEL
contain the new entry. PTEX identifies either the page table entry group
or the specific PTE where the entry is to be added, depending upon the
setting of the Exact flag. If the Exact flag is off, the hypervisor
selects the first free (invalid) PTE in the page table entry group. For
pages with sizes less than or equal to 64 K, Flags further provide the
option to zero the page, and provide two levels of programmed I-Cache
coherence management before activating the page table mapping. This hcall
returns the PTE index of the entered mapping. If the PFO option is
implemented an optional compression flag causes the hypervisor to
initialize the page from one or more compressed data blocks and
optionally (checksum flag) check the end to end block data integrity
prior to adding the entry to the page table. If the compression flag is
not set the checksum flag is ignored. If the Memory Usage Instrumentation
(MUI) option is implemented,
flags allow for initializing MUI state for the page when the translation is
entered.Syntax:Parameters:FlagsCEC CookieZero Page: Zero the System Memory page in real mode before
placing its mapping into the PTE. This flag is ignored for memory mapped
I/O space pages; as an attempt to zero missing memory might result in a
machine check or worse. This function should use a processor dependent
algorithm optimized for maximum performance on the specific hardware.
This usually is a sequence of dcbz instructions. Setting this flag for a
page with a size larger than 64 K will result in return code of
H_TOO_BIG.I-Cache-Invalidate: Issue an icbi etc. instruction sequence to
manage the I-Cache coherency of the cachable page. This flag is ignored
for memory mapped I/O pages. For use when the D-Cache is known to be
clean, before placing its mapping into the PTE. Setting this flag for a
page with a size larger than 64 K will result in return code of
H_TOO_BIG.I-Cache-Synchronize: Issue dcbst and icbi, etc., instruction
sequence to manage the I-Cache coherency of the cachable page. This flag
is ignored for memory mapped I/O pages. For use when the D-Cache may
contain modified data, before placing its mapping into the PTE. Setting
this flag for a page with a size larger than 64 K will result in return
code of H_TOO_BIG.Exact: Place the entry in the exact PTE specified by PTEX if it
is empty else return H_PTEG_FULL.For the CMO option: CMO Option flags as defined in
.For the MUI option: The HBA bits specify settings of implementation dependent PTE bits and
associated MUI array entries for the page who’s translation is being entered.For the MUI option: The Affinity-Clear and Page-Age-Clear bits clear associated MUI array
entries for the page who’s translation is being entered.PTEX (index of the first PTE in the page table entry group to be
used for the PTE insertion)PTEH -- the high order 8 bytes of the page table entry.PTEL -- the low order 8 bytes of the page table entry.Semantics:The hypervisor checks that the logical page number is within the
bounds of partition allocated memory resources, else returns
H_Parameter.If the Shared Logical Resource option is implemented and the
logical page number represents a page that has been rescinded by the
owner, return H_RESCINDED.The hypervisor checks that the address boundary matches the
setting of the input PTE’s large page bits; else return
H_Parameter.The hypervisor checks that the page size described by the setting
of the input PTE’s page size bits is less than or equal to the
largest page size supported by the logical region that is being mapped;
else return H_Parameter.The hypervisor checks that the WIMG bits within the PTE are
appropriate for the physical page number else H_Parameter return. (For
System Memory pages WIMG=0010, or, 1110 if the SAO option is enabled, and
for IO pages WIMG=01**.)For pages with sizes greater than 64 K, the hypervisor checks
that the Zero Page, I-Cache-Invalidate, and I-Cache_Synchronize bits of
the Flags parameter are not set; else return H_TOO_BIG.Force off RS mode reserved PTEL bits (1In addition, bits 52 and 53 are forced off on platforms that
either do not contain an
“ibm,processor-storage-keys” property,
or contain an
“ibm,processor-storage-keys” property
with the value of zero in both cells. Bit 0 is forced off on platforms
that do not have the “Support for the “110” value of
the Page Protection (PP) bits” bit set to a value of 1 in the
“ibm,pa-features” property.) as well as hypervisor reserved software bits (57 and 58) in
PTEH.If the Exact flag is off, set the low order 3 bits of the PTEX to
zero (insures that the algorithm stays inside partition’s PFT and
is faster than a check and error code response).If the Zero Page flag is set, use optimized routine to clear page
(usually series of dcbz instructions).For the PFO option: if the compression flag is on thenCheck that the calling partition is authorized to use the
compression co-processor else return H_Function.If the page is not “main store memory” then return
H_UNSUPPORTED_FLAG.Build CRB for decompressionIf the checksum flag is on command that a checksum be
verified.Validate the inlen/in parameters and build the source DDEVerify that the “in” parameter represents a valid
logical real address within the caller’s partition else return
H_P4If the “inlen” parameter is non-negative verify that
the logical real address of (in + inlen) is a valid logical real address
within the same 4K page as the “in” parameter else return
H_P5.If the “inlen” parameter is negative: Verify that the
absolute value of inlen meet all of the follow else return H_P5:Is <= the value of
“ibm,max-sg-len”Is an even multiple of 16That in + the absolute value of inlen represents a valid logical
real address within the same 4K page as the in parameter.Verify that each 16 byte scatter gather list entry meets all of
the following else return H_SG_LIST:Verify that the first 8 bytes represents a valid logical real
address within the caller’s partition.Verify that the logical real address represented by the sum of
the first 8 bytes and the second 8 bytes is a valid logical real address
within the same 4K page as the first 8 bytes.Verify that the sum of all the scatter gather length fields
(second 8 bytes of each 16 byte entry) is <= the decompression value
in
“ibm,max-sync-cop” else return
H_TOO_BIG.For the Shared Logical Resource Option if any of the memory
represented by the in/inlen parameters have been rescinded then return
H_RESCINDED.Fill in the source DDE list from the converted the in/inlen
parameters.Build the destination DDE referencing the start of the PTE page
with the length of the PTE page size.Issue icswx instruction to execute CRBCheck coprocessor busy - retry / return H_Busy if execution
time exhausted / return H_Hardware if compressor is brokenWait for coprocessor to completeIf compressor ran out of destination space return
H_TOO_BIGCheck that the decompression filled the full page else return
H_AbortedIf the checksum flag is on check that the data is valid else
return H_BAD_DATAIf hardware error return H_HardwareIf the I-Cache-Invalidate flag is set, issue icbi instructions
for all of the page’s cache linesIf the Cache-Synchronize flag is set, issue dcbst and icbi
instructions for all of the page’s cache lines. Implementations may
need to issue a sync instruction to complete the coherency management of
the I-Cache.The hypervisor selects a PTE within the page table entry group
using the following.If the MUI option is enabled
{
Switch (HBA flags field)
{
Case ‘0b11: return H_UNSUPPORTED_FLAG;
Case ‘0b10:
enable implementation dependent HBA update bit in PTE;
set implementation dependent PTE PUT field to previous time;
Set Reference History Bit Array entry for physical page to configured initial
value;
Case ‘0b01:
enable implementation HBA update in PTE;
set implementation dependent PTE PUT field to previous time;
Default: disable implementation dependent HBA update bit in PTE;
}
If the Affinity-Clear flag is on clear the MUI ALA for the page;
If the Page-Age-Clear flag is on clear the MUI PAG for the page;
}Algorithm:if Exact flag is on then set t to 0 else set t to 7for i=0;i<= t; i++Combine page table base, PTEX and offset base on (i) into
R3R8 <- ldarx PTEH(R3) /* prepare to take a lock on the PTE
*/if PTE is valid (R8 (bit 63) is set) then continueif PTE is locked (R8 (bit 57) is set) then continueset R8 (bit 57) /* prepare to lock PTE */PTEH(R3) <- stdcx R8 /* attempt to take lock */if stdcx failed continuegoto insertreturn H_PTEG_FULLinsert: use code sequence from PA Book IIIconstruct return PTEX (R4 <- (R3 - PFTbase) shifted down 4
places)For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.return H_SuccessH_READThis hcall returns the contents of a specific PTE in registers R4
and R5.Syntax:Parameters:flags:CEC Cookie: Cross CEC PFT accessREAD_4: Return 4 PTEsR-XLATE: Include a valid logical page number in the PTE if the
valid bit is set, else the contents of the logical page number field is
undefined.For the CMO option: CMO Option flags as defined in
.PTEX (index of the PTE in the page table to be used -- if the
READ_4 flag is set the low order two bits of the PTEX are forced to zero
by the hypervisor to insure that they are in the range of the PTEG and it
is faster than checking.)Semantics:Checks that the PTEX is within the defined range of the
partition’s PFT else return H_ParmaeterIf the READ_4 bit is clear Then load the specified PTE into R4
and R5If R-XLATE flag is set, then reverse translate the RPN field into
the logical page number.Elseclear the two low order bits of the PTEX (faster than checking
them)load the 4 PTEs starting at PTEX into R4 through R11.If R-XLATE flag is set, then reverse translate the RPN fields
into the logical page number.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.Set H_Success in R3 and returnH_CLEAR_MOD
This hcall clears the modified bit in the specific PTE. The second
double word of the old PTE is returned in R4.Syntax:Parameters:flags: For the CMO option: CMO Option flags as defined in
.PTEX (index of the PTE in the page table to be used)Semantics:Check that the PTEX accesses within the PFT, else returns
H_ParameterCheck that the “V” bit is one, else return
H_Not_Found.Fetch the low order double word of the PTE into R4. If the
“C” bit is zero, then return H_Success.The hypervisor synchronizes the PTE specified by the PTEX, clears
the mod bit, and returns its old value:Use the architected “Modifying a Page Table Entry General
Case” sequence from PA Book III.Only PTE bits to be modified are:In double word 0 SW bit 57 and the V bit (63)In double word 1, C bit (56).Use the proper tlbie instruction for the page size (per large
page flag within PTE) within a critical section protected by the proper
lock.The second double word of the old PTE value ends up in R4.At the point where the new values are to be activated, use the
old values with the “C” bit cleared.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.Return H_SuccessH_CLEAR_REFThis hcall clears the reference bit in the specific PTE. The second
double word of the old PTE is returned in R4.Syntax:Parameters:flags: For the CMO option: CMO Option flags as defined in
.PTEX (index of the PTE in the page table to be used)Semantics:Check that the PTEX accesses within the PFT, else return
H_Parameter.Check that the “V” bit is one, else return
H_Not_Found.Only PTE bits to be modified are:In double word 1 the R bit (55)Use the architected “Resetting the Reference Bit”
sequence from PA Book III with the original second double word of the PTE
ending up in R4.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.Return H_SuccessH_PROTECTThis hcall sets the page protect bits in the specific PTE.Syntax:Parameters:flags: AVPN, pp0The pp0 portion of the flags parameter is ignored on platforms
that do not have the “Support for the “110” value of
the Page Protection (PP) bits” bit set to a value of 1 in the
“ibm,pa-features” property., pp1, pp2, key0-key4The key0-key4 portion of the flags parameter is ignored on
platforms that either do not contain an
“ibm,processor-storage-keys” property,
or contain an
“ibm,processor-storage-keys” property
with the value of zero in both cells., n, and for the CMO option: CMO Option flags as defined in
.PTEX (index of the PTE in the page table to be used)AVPN: Optional “Abbreviated Virtual Page Number” --
used as a check for the correct PTE when the AVPN flag is set.Semantics:Check that the PTEX accesses within the PFT, else return
H_ParameterCheck that the “V” bit is one, else return
H_Not_Found.If the AVPN flag is set, and the AVPN parameter bits 0-56 do not
match that of the specified PTE, then return H_Not_Found.The hypervisor synchronizes the PTE specified by the PTEX, sets
the pp0The pp0 bit is not modified on platforms that do not have the
“Support for the “110” value of the Page Protection
(PP) bits” bit set to a value of 1 in the
“ibm,pa-features” property., pp1, pp2, key0-key4The key0 - key4 bits are not modified on platforms that either
do not contain an “ibm,processor-storage-keys”
property, or contain an
“ibm,processor-storage-keys” property
with the value of zero in both cells., and n bits per the flags parameter.Only PTE bits to be modified are:In double word 0 SW bit 57 and the V bit (63)In double word 1 pp0, pp1, pp2,
key0-key4, and nUse the architected “Modifying a Page Table Entry General
Case” sequence.Use the proper tlbie instruction for the page size (per value in
PTE) within a critical section protected by the proper lock.At the point where the new values are to be activated use the old
values with the “R” bit cleared and the
pp0,
pp1, pp2, key0-key4,
and n bits set as specified in
the flags parameter.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.Return H_SuccessH_BULK_REMOVEThis hcall is for invalidating up to four entries in the page
table. The PTEX in the translation specifier high parameters identifies
the specific page table entries.Prototype:Syntax:Translation specifiers:Each is 16 bytes long made up of two 8 byte double words; a
translation specifier high and a translation specifier low.Translation Specifier High double word:First byte (0) is a control/status byte:High order two bits (0 and 1) are type code:00 Unused -- if found stop processing and return
H_PARAMETER01 Request -- Processes As per H_REMOVE as modified by low order
two control bits.10 Response -- written by hypervisor as a return status from
processing individual “request” translation specifier11 End of String -- if found stop processing and return
H_Success.Next two bits (2 and 3) are response code (in response to
processing an individual “request” translation specifier
(type code modified to 10)):00 Success -- the specified translation was removed as per
H_REMOVE with the PTE's RC bits in the next two status bits.01 Not found -- the specified translation was not found as per
H_REMOVE.10 H_PARM -- one or more of the parameters of the specified
translation were invalid per H_REMOVE -- processing of the bulk entries
stops at this point and the hypervisor returns H_PARAMETER.11 H_HW -- The hardware experienced an uncorrected error
processing this translation specifier -- processing of the bulk entries
stops at this point and the hypervisor returns H_HARDWARE.Next two bits (4 and 5) are the Reference/Change bits from the
removed PTE (These bits are only valid if bits 0-3 are 1000):Low order two bits (6 and 7) are request modification
flags:00 absolute -- remove the specified PTEX entry
unconditionally01 andcond -- remove the specified PTEX entry as with the andcond
flag of H_REMOVE10 AVPN -- remove the specified PTEX entry as with the AVPN flag
of H_REMOVE11 not used -- if found stop processing and return
H_PARAMETER.Bytes 1 through 7 are the PTEX (PFT byte offset divided by
16)Translation Specifier Low double word:Bytes 0 through 7 are the AVPN as per H_REMOVESemantics:For each translation specifier, while the translation specifier
is not “end of string”:Check that the PTEX accesses within the PFT else set H_PARM
response status in the specific translation specifier high register and
return H_ParameterIf the AVPN flag is set, and the AVPN parameter bits 0-56 do not
match that of the specified PTE then set response status Not found in the
specific translation specifier high register, Continue.If the andcond flag is set, the AVPN parameter is bit anded with
the first double word of the specified PTE (after bits 57-63 of the PTE
have been masked), if the result is non-zero, then set response status
Not found in the specific translation specifier high register, Continue.
(Note the low order 7 bits of the AVPN parameter should be zero otherwise
the likely result is a response status of Not found).The hypervisor Synchronizes the PTE specified by the PTEX.Use the architected “Deleting a Page Table Entry”
sequence.Use the proper tlbie instruction for the page size within a
critical section protected by the proper lock (per large page bit in the
specified PTE).The synchronized value of the old PTE RC bits ends up in bits 4
and 5 of the individual translation specifier high register along with
success response status.return H_SuccessH_BLOCK_REMOVEThis hcall removes up to eight sequential virtual
page table entries. Some platforms that support this hcall() might remove
fewer than 8 entries for a given actual page size / base page
size combination as communicated by the “Block Invalidate
Characteristics” system parameter (see
).
The virtual pages are all within
the same naturally aligned 8 page virtual address block and
have the same page and segment size encodings. The AVA parameter,
if used, covers the entire block of virtual page
addresses. If another processor is currently accessing
the page table entry, the entry is not removed. The availability of
is hcall() might change after partition migration, the
caller should be prepared for an H_Function return code. The
PTEX field of the translation specifier parameters
identifies the specific page table entries.Syntax:The AVA parameter is the 8 byte AVPN as per H_REMOVE.Each Translation Specifier is 8 bytes long:
H_BLOCK_REMOVE Control Byte FormatControlDescriptionType00Unused01Request0absolute -- remove the specified PTEX entry unconditionally1AVPN -- remove the specified PTEX entry as with the AVPN flag of H_REMOVEPage State00Inhibit page usage state change01Reserved10For CMO option set page usage state to “Unused” if Success11For CMO option set page usage state to “Loaned” if Success10Response000Success -- the specified translation was removed as per H_REMOVE
with the PTE's RC bits in the next two status bits.001Not found -- the specified translation was either not found a
s per H_REMOVE, Invalid (V bit = 0), or entry was “bolted” (PTE bit 59 = 1)010H_PARM Parameter is invalid011Inconsistent Page/Segment Size (does not match the L||LP and B
fields of the block anchor Page Table Entry)100Busy (The page table entry is being modified by another processor)101Cross Boundary (The page table entry crosses an 8 page virtual address boundary)110Beyond Capacity (The page table entry exceeded the number supported on this platform)111The hardware experienced an uncorrected error processing this translation specifier --
processing of the bulk entries stops at this point and the hypervisor returns H_HARDWARE.RReference bit from the removed PTE (bit is only valid if bits 0-4 are 10000)RReference bit from the removed PTE (bit is only valid if bits 0-4 are 10000)11End of String -- if found stop processing and return the accumulated return code.
Semantics:Initialize return code to H_Success (overwritten if appropriate)Initialize “anchored” flag to false and PTECOUNT to zero.For each translation specifier, while the translation specifier type is not “end of string”:If the translation specifer type is not “Request” Return H_PARAMETER.Check that the PTEX accesses within the PFT else set H_PARM response status in the specific translation specifier
high register set return code to H_PARTIAL and Continue.If the lock for the associated page table entry can not be immediately obtained, then set the TSn response code to
“Busy”, set return code to H_PARTIAL and Continue.If the PTEX specified entry is either invalid (PTE V bit = 0) or “bolted” (PTE bit 59 = 1) then set response status
“Not found” in the specific translation specifier high register, set return code to H_PARTIAL and Continue.Check that actual page size / base page size combination of the PTEX specified entry is supported for
H_BLOCK_REMOVE as communicated by the “Block Invalidate Characteristics” system parameter else set
H_PARM response status in the specific translation specifier high register set return code to H_PARTIAL and
Continue.If the AVPN flag is set, and the AVPN parameter bits 0-56 do not match that of the specified PTE then set response
status “Not found” in the specific translation specifier high register, set return code to H_PARTIAL and
Continue.If NOT Anchored:then:Establish the block L||LPEstablish the block segment size encodingFor the CMO option: if the TS Control byte Page State bits are a reserved value then set H_PARM response
status in the specific translation specifier high register, set return code to H_PARTIAL and Continue; else If
the block segment encoding is an MPSS segment then set the page usage state for the large page per the
CMO Page State bits of the TS Control byte; else set the page usage state per the CMO Page State bits of the
TS specified page per the TS Control byte.Establish the block plus high order virtual addressEstablish the number of TLBs that the platform can invalidate in one operation from the associated page table
entrySet the “anchored” flag to true;else:If the associated page table entry L||LP and segment size encoding does not match the established anchored
values then set the TSn response code to “Inconsistent Page/Segment Size“, set return code to H_PARTIAL
and Continue.If the associated page table entry high order virtual address of the 8 page block does not match the established
anchored values then set the TSn response code to “Cross Boundary“, set return code to H_PARTIAL
and Continue.If PTECOUNT is greater than the number of TLBs that the platform can invalidate in one operation, then set
the TSn response code to “Beyond Capacity“, set return code to H_PARTIAL and Continue.For the CMO option: if the TS Control byte Page State bits are a reserved value then set H_PARM response
status in the specific translation specifier high register, set return code to H_PARTIAL and Continue; else if
the block segment size encoding is not MPSS then set the page usage state per the CMO Page State bits of
the TS Control byte.Add the PTEX to the validated list of PTEX’s to be removedIncrement PTECOUNTThe hypervisor resets the valid bit in the PTEs specified by the validated list of PTEX’s to be removed.The hypervisor issues a single instance of the PTE Synchronization sequence outlined in the
architecture Book IIIS under “Deleting a Page Table Entry” using the proper tlbie instruction for the page size within a critical
section protected by the proper lock (per large page bit in the specified PTE) to cover all the PTEs specified by
the validated list of PTEX’s to be removed.The synchronized value of the old PTE RC bits, for the PTEs specified by the validated list of PTEX’s to be removed,
ends up in bits 5 and 6 of the individual translation specifier high register along with success response status.Release acquired page table entry locks.Return the accumulated return code and TS values.Hash Page Table Resize OptionThe hash page table (HPT) for an operating system needs to be
sized depending on the size of the partition's memory. The usual rule of
thumb is that the HPT should be 1/64th of the size of memory (although
Linux will typically work well with 1/128th or even less depending on
available page sizes). An HPT which is too small will lead to poor performance,
or even crashes, if the OS is unable to fit necessary bolted mappings into
the table. An HPT which is too large wastes memory and leads to slower
TLB misses due to increased cache misses on table walks.With the size of the HPT fixed at boot, a partition allowing memory
reconfiguration must size the HPT according to the partition's maximum
possible memory size. If the partition has a very large potential maximum
memory size, but is unlikely to reach that in practice, this can lead to
significant wastage of resources in the oversized hash table. By allowing
a partition to change its HPT size at runtime, it can start with an HPT
sized just for its initial memory, and change it if necessary when more
memory is added dynamically.If the platform supports the Hash Page Table Resize Option, then it supports
the two hcalls defined in this section, which allow an OS to request that its
HPT should be resized. The resize operation is performed in two phases, the
“prepare” phase and the “commit” phase. The prepare phase may take place
concurrently with normal guest operation. The commit phase requires that the
guest perform no concurrent updates or accesses to the HPT (which in practice
requires no non-real mode memory accesses).
During operation a partition may have a “Pending HPT”, a block of platform
memory organized as a Power hash page table which may become the partition's
HPT in future. The following data are associated with a Pending HPT:
Does a Pending HPT currently exist?The Pending HPT's sizeFlags associated with the Pending HPT (this is for future
extension, no flags are currently defined)Whether the Pending HPT is fully prepared or notIn order to prevent a partition from tying up platform resources
indefinitely with a Pending HPT, the platform is permitted to discard a
Pending HPT at any time. Operating systems should be prepared to deal
with a failure of either hcall because of this.The platform is permitted to start a partition with its HPT sized
for the current memory allocation, rather than the maximum memory for
the partition, provided that if the OS indicates via the
ibm,client-architecture-support call that it does not support HPT resizing,
the platform must then resize the HPT according to the partition's maximum
memory, using a reconfiguration reboot if necessary.H_RESIZE_HPT_PREPAREThis hcall controls the prepare phase of HPT resizing. After
successful completion of this hcall, the partition has a Pending
HPT which can be made the partition's current HPT.Syntax:Parameters:flags: 0, as no flags are currently defined.shift: Log base 2 of the total size in bytes of the
requested new HPT, either 0 (used to discard a Pending HPT)
or else between 18 and 46.Semantics:Check that the partition is permitted to resize its HPT, else
return H_AUTHORITY.Check if there is a Pending HPT; if there is, then:
If the Pending HPT size and flags match the parameters requested in this call, then:
If the Pending HPT is not fully prepared, return
H_LONG_BUSY_xxx with an estimate of the time remaining
to complete preparation of the Pending HPTIf preparation of the Pending HPT has terminated
due to two bolted HPTEs needing to occupy the same
slot of the same HPTEG, then return H_PTEG_FULLElse return H_SUCCESSElse discard the Pending HPT and continueIf the flags parameter is non-zero, then return H_PARAMETERIf shift is zero, then return H_SUCCESSIf (shift < 18) or (shift > 46), then return H_PARAMETERCheck that the partition is permitted to have an HPT of 2shift
bytes; if not, return H_RESOURCECreate a new Pending HPT of size 2shift bytes. Preparation of
the new HPT may continue asynchronously.If the Pending HPT is not fully prepared, return
H_LONG_BUSY_xxx with an estimate of the time remaining to
complete preparation.Else return H_SUCCESS.H_RESIZE_HPT_COMMITThis hcall executes the commit phase of HPT resizing, making the
Pending HPT the partition's current HPT. The caller must ensure that
while it is executing, none of the partition's virtual CPU threads
will update or access the HPT; that is, all threads must be executing
in real mode, or stopped.Syntax:Parameters:flags: 0, as no flags are currently defined.shift: Log base 2 of the total size in bytes of the
requested new HPT, between 18 and 46.Semantics:Check that the partition is permitted to resize its HPT,
else return H_AUTHORITY.Check if there is a Pending HPT; if there is not, then
return H_CLOSED.Check that the flags parameter is zero and the shift
parameter matches the size of the Pending HPT; if not, then
return H_CLOSED.Check that the Pending HPT is fully prepared; if not,
return H_BUSY.If preparation of the Pending HPT was terminated due to finding
two bolted HPTEs that need to occupy the same
slot of the same HPTEG, then return H_PTEG_FULL.Ensure that all bolted HPTEs from the partition's existing HPT
also exist, correctly hashed, in the Pending HPT. HPTEs transferred
from the existing HPT must have the same slot within their HPTEG in
the Pending HPT as they did in the existing HPT. If the Pending HPT
is smaller than the existing HPT, it is possible that two bolted HPTEs
that are in the same slot of two separate HPTEGs in the existing HPT
need to be put into the same HPTEG in the Pending HPT. If this occurs,
return H_PTEG_FULL.Discard the partition's existing HPT.Make the Pending HPT be the partition's current HPT.Mark the partition as having no Pending HPT.Return H_Success.Translation Control Entry AccessThe Translation Control Entry (TCE) access hcall()s take as a
parameters the Logical I/O Bus Number (LIOBN) that is the logical bus
number value derived from the
“ibm,dma-window” property associated with
the particular IOA. For the format of the
“ibm,dma-window” property, reference
.H_GET_TCEThis hcall() returns the contents of a specified Translation
Control Entry.Syntax:Parameters:LIOBN (Logical I/O Bus Number for TCE table to be
accessed)IOBA (I/O Bus Address for indexing into the TCE table)Semantics:If the LIOBN, or IOBA are outside of the range of calling
partition assigned values return H_PARAMETER.If the Shared Logical Resource option is implemented and the
LIOBN, or IOBA represents a logical resource that has been rescinded by
the owner, return H_RESCINDED.Load R4 with the specified TCE contents.If specified TCE’s Page Mapping and Control bits (see
) specify “Page Fault” then
return H_SuccessReverse translate the TCE’s RPN field into its logical page
numberIf the logical page number is owned by the calling partition then
replace the RPN field of R4 with the logical page number and return
H_Success.Logically OR the contents of R4 with 0xFFFFFFFFFFFFF000 placing
the result into R4.Return H_PARTIAL.H_PUT_TCEThis hcall() enters the mapping of a single 4 K page into the
specified Translation Control Entry.Syntax:Semantics:If the LIOBN or IOBA parameters are outside of the range of
calling partition assigned values return H_PARAMETER.If the Shared Logical Resource option is implemented and the
LIOBN, or IOBA represents a logical resource that has been rescinded by
the owner, return H_RESCINDED.If the Page Mapping and Control field of the TCE is not
“Page Fault” (see
)Then if the logical address within the TCE parameter is outside
of the range of calling partition assigned valuesThen return H_PARAMETER.Else translate the logical address within the TCE parameter into
the corresponding physical real address.The hypervisor stores the TCE resultant value in the TCE table
specified by the LIOBN and IOBA parameters; returning H_Success. (In the
“Page Fault” case the RPN remains untranslated.)Software Note: The PA requires the OS to issue a sync instruction
to proceed the signalling of an IOA to start an IO operation involving
DMA to guarantee the global visibility of both DMA and TCE data. This
hcall() does not include a sync instruction to guarantee global
visibility of TCE data and in no way diminishes the requirement for the
OS to issue it.H_STUFF_TCEThis hcall() duplicates the mapping of a single 4 K page through
out a contiguous range of Translation Control Entries. Thus, in
initializing and/or invalidating many entries. To retain interrupt
responsiveness this hcall() should be called with a count parameter of no
more than 512, LoPAR architecture provides enforcement for this
restriction to aid in client code debug.Syntax:Semantics:If the LIOBN, or IOBA, are outside of the range of calling
partition assigned values return H_PARAMETER.If the Shared Logical Resource option is implemented and the
LIOBN, or IOBA represents a logical resource that has been rescinded by
the owner, return H_RESCINDED.If the count parameter is greater than 512 then return
H_P4If the count parameter added to the TCE index specified by IOBA
is outside of the range of the calling partition assigned values return
H_PARAMETER.If the Page Mapping and Control field of the TCE is not
“Page Fault” (see
)Then if the logical address within the TCE parameter is outside
of the range of calling partition assigned valuesThen return H_PARAMETER.If the Shared Logical Resource option is implemented and the
logical address’s page number represents a page that has been
rescinded by the owner, return H_RESCINDED.Else translate the logical address within the TCE parameter into
the corresponding physical real address.The hypervisor stores the TCE resultant value in the TCE table
entries specified by the LIOBN, IOBA and count parameters; returning
H_Success. (In the “Page Fault” case the RPN remains
untranslated.)Implementation Note: The PA requires the OS to issue a sync
instruction to proceed the signaling of an IOA to start an IO operation
involving DMA to guarantee the global visibility of both DMA and TCE
data. This hcall() does not include a sync instruction to guarantee
global visibility of TCE data and in no way diminishes the requirement
for the OS to issue it.H_PUT_TCE_INDIRECTThis hcall() enters the mapping of up to 512 4 K pages into the
specified Translation Control Entry. The LIOBN parameter if positive is
the cookie (LIOBN) of the specific TCE table to load. For the Multi-TCE
Table (MTT) option, if the LIOBN parameter is negative, CNT = the
absolute value of LIOBN (up to 128), and the first CNT 8 byte entries of
the buffer referenced by the TCE parameter contains the TCE table cookies
(LIOBNs) for the various TCE tables to load (up to a maximum of 128 TCE
tables).Note: Users of the MTT option that are subject to
partition migration should be prepared for the loss of support for the
MTT option after partition migration.Syntax:Semantics:/* Validate the input parameters */If the LIOBN parameter is non-negative then doIf the count parameter is > 512 then return
H_Parameter.If the Shared Logical Resource option is implemented and the
LIOBN parameter represents a TCE table that has been rescinded by the
owner, return H_RESCINDED.If the LIOBN parameter represents a TCE table that is not valid
for the calling partition, return H_Parameter.Liobns[0] = the LIOBN parameter.If the Shared Logical Resource option is implemented and any of
the I/O bus address range represented the IOBA parameter plus count pages
within the TCE table represented by the LIOBN parameter represents
rescinded resource, return H_RESCINDED.If any of the I/O bus address range represented the IOBA
parameter plus count pages within the TCE table represented by the LIOBN
parameter is not valid for the calling partition then return
H_Parameter.endElse doIf the MTT Option is not enabled return H_Function.If the LIOBN parameter < -128 then return H_Parameter.If the sum of the count parameter plus |LIOBN| is > 512 then
return H_Parameter.endIf the Shared Logical Resource option is implemented and the TCE
parameter represents a logical page address of a page that has been
rescinded by the owner, return H_RESCINDED.If the TCE parameter represents the logical page address of a
page that is not valid for the calling partition, return
H_Parameter.Copy the contents of the page referenced by the TCE table to a
temporary hypervisor page (Temp) for validation without the potential for
caller manipulation./* Validate the indirect parameters */VAL= 0If the LIOBN parameter is negative then doFor CNT = 1,|LIOBN|,1T = 8 byte entry Temp [VAL]If the Shared Logical Resource option is implemented and T as an
LIOBN represents a TCE table that has been rescinded by the owner, return
H_RESCINDED.If T as an LIOBN represents a TCE table that is not valid for the
calling partition, return H_Parameter.Liobns[VAL+] = T.If the Shared Logical Resource option is implemented and any of
the I/O bus address range represented the IOBA parameter plus count pages
within the TCE table represented by “T” as an LIOBN
represents a rescinded resource, return H_RESCINDED.If any of the I/O bus address range represented the IOBA
parameter plus count pages within the TCE table represented by
“T” as an LIOBN is not valid for the calling partition then
return H_Parameter.loopend/* Translate the logical pages addresses to physical*/for CNT = 1,count,1T = 8 byte entry Temp [VAL+]If the Page Mapping and Control field of the 8 byte entry
“T” is not “Page Fault” (see
) then doIf the Shared Logical Resource option is implemented and the
value of “T” as a logical address represents a page that has
been rescinded by the owner, then return H_RESCINDED.If “T” as a logical address is outside of the range
of calling partition assigned values then return H_PARAMETER.Translate the logical address within the TCE buffer entry into
the corresponding physical real address.Temp[CNT - 1] = translated physical real address.endloop/* Fill the TCE table(s) */If LIOBN parameter is negative then VAL = |LIOBN| else VAL =
1.For TABS = 1, VAL, 1The TCE table to fill is that referenced by Liobns[VAL] as an
LIOBN.INDEX = the page index within the TCE table represented by the
IOBA parameter.For CNT = 1, count, 1TCE_TABLE [Liobns[VAL], INDEX+] = Temp [CNT-1]LoopLoopReturn H_Success.Implementation Note: The PA requires the OS to issue a sync
instruction to proceed the signaling of an IOA to start an IO operation
involving DMA to guarantee the global visibility of both DMA and TCE
data. This hcall() does not include a sync instruction to guarantee
global visibility of TCE data and in no way diminishes the requirement
for the OS to issue it.Processor Register Hypervisor Resource AccessCertain processor registers are architecturally hypervisor
resources, in the following cases the hypervisor provides controlled
write access services.H_SET_SPRG0Syntax:Parameters:value: The value to be written into SPRG0. No parameter checking
is done against this value.H_SET_DABRNote: Implementations reporting compatibility to ISA versions less
than 2.07 are required to implement this interface; however, this
interface is being deprecated in favor of
for newer
implementations.Syntax:Semantics:If the platform does not implement the extended DABR facility
then:Validate the value parameter else return H_RESERVED_DABR and the
value in the DABR is not changed:The DABR BT bit (Breakpoint Translation) is checked for a value
of 1.Else (The platform does implement the extended DABR
facility):Load the DABRX register with 0b0011.place the value parameter into the DABR.Return H_Success.H_PAGE_INITSyntax:Parameters:flags: zero, copy, I-Cache-Invalidate, I-Cache-Synchronize, and
for the CMO option: CMO Option flags as defined in
.destination: The logical address of the start of the page to be
initializedsource: The logical address of the start of the page use as the
source on a page copy initialization. This parameter is only checked and
used if the copy flag is set.Semantics:The logical addresses are checked, they must both point to the
start of a 4 K system memory page associated with the partition or return
H_Parameter.If the Shared Logical Resource option is implemented and the
source/destination logical page number represents a page that has been
rescinded by the owner, return H_RESCINDED.If the zero flag is set, clear the destination page using a
platform specific routine (usually a series of dcbz instructions).If the copy flag is set, execute a platform specific optimized
copy of the full 4 K page from the source to the destination.If I-Cache-Invalidate flag is set, issue icbi instructions for
all of the page’s cache linesIf I-Cache-Synchronize flag is set, issue dcbst and icbi
instructions for all of the page’s cache lines. Implementations may
need to issue a sync instruction to complete the coherency management of
the I-Cache.For the CMO option: set the page usage state per the CMO Option
flags field of the flags parameter as defined in
.Return H_SuccessNote: For the CMO option, the CMO option flags may be
used to notify the platform of the page usage state of a page without
regard to its hardware page table entry or lack there of independent of
any other option flags.H_SET_XDABRNote: Implementations reporting compatibility to ISA versions less
than 2.07 are required to implement this interface; however, this
interface is being deprecated in favor of
for newer
implementations.This hcall() provides support for the extended Data Address
Breakpoint facility. It sets the contents of the Data Address Breakpoint
Register (DABR) and its companion Data Address Breakpoint Register
Extension (DABRX). A principal advantage of the extended DABR facility is
that it allows setting breakpoints for LPAR addresses that the hypervisor
had to preclude using the previous facility.Syntax:Semantics:Validates the extended parameter else return H_Parameter:Reserved Bits (0-59) are zero.The HYP bit (61) is off.The rest of the PRIVM field (Bits 62-63) is one of those
supported:0b01 Problem State0b10 Privileged non-hypervisor0b11 Privileged or Problem State(Specifying neither Problem or Privileged state is not
supported)Load the validated extended parameter into the DABRXLoad the value parameter into the DABRReturn H_Success.H_SET_MODEThis hcall() is used to set hypervisor processing resource mode
registers such as breakpoints and watchpoints. The modes supported by the
hardware processor are a function of the processor architectural level as
reported in the
“cpu-version” property.
presents the valid parameter
ranges for the architectural level reported in the
“cpu-version” property
and the LoPAR architecture level as reported in the
“/chosen” property.Setting breakpoints: A breakpoint is set for a hardware tread.
Should the hardware thread complete an instruction who's effective
address matches that of the set breakpoint a trace interrupt is signaled.
When setting the breakpoint resource, the mflags and value2 parameters
are zero. The value1 parameter is the effective address of the breakpoint
to be set.Setting watchpoints: A watchpoint is set for a hardware tread.
Should the hardware thread attempt to access within the specified double
word range of the effective address specified by the value1 parameter as
qualified by the conditions specified in the value2 parameter a Data
Storage type interrupt is signaled. When setting the watchpoint resource,
the mflags parameter is zero. The value1 parameter is the effective
double word address of the watchpoint to be set. The value2 parameter
specifies the qualifying conditions for the access, these are a subset of
the POWER ISA conditions that are relevant within the context of a
logical partition. This subset includes the MRD field, DW, DR, WT, WTI,
PNH, and PRO bits. All other value2 fields are zero.Setting Interrupt Vector Location Modes: The Alternate Interrupt
Location (AIL) Mode for the calling partition is set. Since this function
has partition wide scope, it may take longer for the hypervisor to
perform the function on all processors than is permissible during a
synchronous call; therefore, the call might return long busy. In that
case the caller should repeat the call with the same parameters after the
specified time period until the H_SUCCESS return code is received. A call
with different parameters indicates the beginning of a new partition wide
mode setting. The desired AIL mode is encoded in the two low order mflags
bits (all other mflags bits are 0) while both value1 and value2
parameters are zero.The POWER ISA requires that the setting the LPCR ILE bit be the same
in all partition processors when not in hypervisor mode thus all partition
processors need to be operating with MSR[EE] = 0 when changing LPCR ILE so
that the OS can change the contents of the interrupt vectors prior to any
interrupts being taken in the new mode.Syntax:Semantics:switch (resource) {
case 0: /* not used /
return H_P2;
break;
case 1: /* Completed Instruction Address Breakpoint Register */
if value2 <> 0 the return H_P4;
if mflags <> 0 then return H_UNSUPPORTED_FLAG;
If low order two bits of value1 are 0b11 then return H_P3; /* not hypervisor
instruction address */
move value 1 into CIABR; /* note the value2 parameter is not used for this resource */
break;
case2: /* Watch point 0 registers */
if mflags <> 0 then return H_UNSUPPORTED_FLAG;
If value2 bit 61 == 0b1 then return H_P4; /* not hypervisor addresses */
move value1 into DAWR0;
move value2 into DAWRX0;
break;
case3: / * Address Translation Mode on Interrupt * /
if value1 <> 0 then return H_P3;
if value2 <> 0 then return H_P4;
switch (mflags) {
case 0: / * IR = DR = 0 * /
Set LPCR AIL field of calling partition processors to 0b00;
break;
case 1: / * reserved * /
return H_UNSUPPORTED_FLAG ( - 318);
break;
case 2: / * IR = DR = 1 interrupt vectors at E.A. 0X18000 * /
Set LPCR AIL field of calling partition processors to 0b10;
break;
case 3: / * IR = DR = 1 interrupt vectors at E.A. 0XC000 0000 0000 4000 * /
Set LPCR AIL field of calling partition processors to 0b11;
break;
default:
return H_UNSUPPORTED_FLAG (value based on most convenient unsupported bit);
break;
/* Starting with PAPR Version 2.8 add the following */
case 4: / * Set the LPCR ILE bit. * /
if (value1) return H_P3;
if (value2) return H_P4;
switch (mflags) {
case 0:
Marshal all partition processors into the hypervisor
If any partition processors were not running MSR[EE] = 0 then
Release all marshaled processors and return bad_mode
Else
On each partition processor set LPCR[ILE] = 0
return H_Success;
case 1:
Marshal all partition processors into the hypervisor
If any partition processors were not running MSR[EE] = 0 then
Release all marshaled processors and return bad_mode
Else
On each partition processor set LPCR[ILE] = 1
return H_Success;
default:
return H_UNSUPPORTED_FLAG (value based on most convenient unsupported bit);
break;
default:
return H_P2;
break; }
H_SET_MODE Parameters per ISA LevelPAPR levelSupported Resource ValuesValues Supported mflagsValue 1Value 2Comments 2.71NoneBreakpoint AddressNone2NoneWatchpoint Double Word AddressWatchpoint Qualifying Conditions30NoneNoneIR=DR=0 No offset1NoneNoneReserved2NoneNoneIR=DR=1 offset 0x180003NoneNoneIR=DR=1 offset 0xC000 0000 0000 4000All OthersAll OthersAll OthersReserved2.8All OthersAll OthersAll OthersAll OthersReserved
R1--1.For implementations supporting POWER
ISA level 2.07 and beyond: the platform must implement the H_SET_MODE
hcall() per the syntax and semantics of section
per the LoPAR level supported.Implementation Dependent OptimizationsAll platforms contain implementation specific switches and values that effect the performance of the platform. The default
settings for switches and values are tuned during platform development to achieve the desired performance characteristics
across a wide range of workloads. However, the performance of specific workloads might be further
optimized by adjusting some of these implementation specific switches and values when those workloads are being
run. Other of these switches and values might have negative effects on other platform workloads, so those switches and
values are protected from modification lest innocent partitions become victims of one of their neighbors. LoPAR version
2.8 and above provide the hcall()s defined below to set a subset of the implementaion specific switches and adjust a
subset of the tuning values within a range that has been proven to be safe. The caller is expected to understand the
switch banks and resources implemented by the specific platform and the functinality of each individual switch and resource.Special consideration is required of the caller of these functions during partition migration and micro-checkpoint/
failover operations since the underlying implementation might change. During these operarations, the implementation
dependent switches and values are set to their default values for the implementation that is receiving the
partition. After a migration or failover event the availability of the Implementation Dependent Optimization hcall()s
might change, along with the resources and/or switch banks that might be adjusted, and the supported values for those
adjustments.H_ADJUST_RESOURCEThis hcall() is used to adjust the value of a given implementation dependent resource in contiguous unit steps between
the minimum and maximum supported values. These steps are not necessarily uniform either in physical values set into
the implementation dependent resource or the resultant effect they have on workload performance.Syntax:Semantics:If the value of the Resource parameter is zero then return H_PARAMETERIf the value of the Resource parameter is greater than the maximum defined
value then return H_RESERVEDIf the value of the Resource paramter is not supported on this implementation
then return H_UNSUPPORTEDRC = H_SuccessCurrent = current value of the ResourceIf Current + Value > max supported value of Resource then{ RC = H_Constrained ;Current = max supported value of Resource }If Current + Value < min supported value of Resource then{ RC = H_Constrained ;Current = min supported value of Resource }Set Resource to CurrentOn return:R3: Contains Return Code (RC)R4: Contains the Resource value (Current) (Return codes H_Success & H_Constrained)R5: Contains the number of steps to minimum supported resource valueR6: Contains the number of steps to maximum Suppported resource valueH_SET_SWITCHESThis hcall() provides for the setting of an implementation dependent subset of the switches in an implementation dependent
bank of switches.Syntax:Semantics:If the value of the Bank parameter is zero then return H_PARAMETERIf the value of the Bank parameter is greater than the maximum defined
value then return H_RESERVEDIf the value of the Bank paramter is not supported on this implementation
then return H_UNSUPPORTEDIf (Mask & not (Supported-bits-in[Bank]) ) then RC = H_Constrained
else RC = H_SuccessTurn on the bits in switches[Bank] that are ones in all three of
Supported-bits-in[Bank], Mask, and SettingTurn off the bits in switches[Bank] that are ones in both
Supported-bits-in[Bank] and Mask but are zeros in SettingOn return:R3: Contains Return Code (RC)R4: Contains the Mask value representing all switches who's setting is supported for the bankR5: Contains the Bank value (Return codes H_Success & H_Constrained)Debugger Support hcall()sThe real mode debugger needs to be able to get to its async port
and beyond the real mode limit register without turning on virtual
address translation. The following hcall()s provide that
capability.H_LOGICAL_CI_LOADSyntax:Parameters:size: The size of the cache inhibited load:byte = 1half = 2full = 4double=8All other size values are illegal and returns H_Parameteraddr: The logical address of the cache inhibited location to be
read. The hypervisor checks that the address is within the range of
addresses valid for the partition, on a boundary equal to the requested
length, is not to the location BA+4 within an interrupt management area,
and mapped as cache inhibited (cache paradoxes are to be avoided)-- Else
H_Parameter.On successful return (H_Success), the read value is low order
justified in register R4.H_LOGICAL_CI_STORESyntax:Parameters:size: The size of the cache inhibited store:byte = 1half = 2full = 4double=8All other size values are illegal and returns H_Parameteraddr: The logical address of the cache inhibited location to be
written. The hypervisor checks that the address is within the range of
addresses valid for the partition, on a boundary equal to the requested
length, is not to the location BA+4 within an interrupt management area,
and mapped as cache inhibited (cache paradoxes are to be avoided).value The value to be written is low order justified in register
R6.Virtual Terminal SupportThis section has been moved to
.Architecture and Implementation Note: The requirement
to provide the
“ibm,termno” property in the
/rtas node, has been removed (it is now necessary to
look for
vty nodes and use their unit address from the
“reg” property to get the same
information). The
“ibm,termno” property called for
sequential terminal numbers, but with the use of unit addresses from the
“reg” property, such is not the
case.Dump Support hcall()sTo allow the OS to dump hypervisor data areas in support of field
problem diagnosis the hcall-dump support function set contains the
H_HYPERVISOR_DATA hcall(). This hcall() is enabled or disabled (default
disabled) via the Hardware Management Console. If the hcall-dump function
set is disabled an attempt to make a H_HYPERVISOR_DATA hcall() returns
H_Function. When the function is enabled, the hcall-dump function set is
specified in the
“ibm,hypertas-functions” property. The
requester calls repeatedly starting with a control value of zero getting
back 64 bytes per call and setting the control parameter on the next call
to the previous call’s return code until the hcall() returns
H_Parameter indicating that all hypervisor data has been dumped. The
precise meaning of the sequence of data is implementation dependent. The
H_HYPERVISOR_DATA hcall() need only return data in the firmware working
storage that is not contained in the PFT or TCE tables since the contents
of these tables are available to the OS.Starting with LoPAR Version 2.8 PAPR platforms support the
H_CLEAR_HPT hcall() independent of Client Architecture support negotiation.H_HYPERVISOR_DATASyntax:Parameters:control: A value passed to establish the progress of the
dump.Semantics:If the control value is zero, the data returned is the first
segment of the hypervisor’s working storage, with a non-negative
return code.If the control value is equal to the return code of the last
H_HYPERVISOR_DATA call, and the return code is non-negative, the data
returned in R4 through R11 is the next sequential segment of the
hypervisor’s working storage. The contents of R4 through R11 are
undefined if the return code is negative.Implementation Note: It is expected that the control
value is be used by the H_HYPERVISOR_DATA routine as an offset into the
hypervisor’s data area. For the expected implementation, hypervisor
checks the value of the control parameter to insure that the resultant
pointer is within hypervisor’s data area else it returns
H_Parameter.H_CLEAR_HPTThis hcall() clears the hash page table (HPT) for a partition in preparation for a restart. The Virtual Real
Mode Area and Partition Adjunct mappings are exempted. The performance class of this hcall() is
“Terminal”, that is, it is allowed to take as long as it needs to perform the operation in a single call,
however, it is also allowed to return H_CONTINUE, at which time the caller needs to again make the call
until it receives H_Success else the partition HPT might be left in an inconsistent state. Never the less, the
reason for this hcall() is to optimize the performance of this function relative to a series of H_REMOVE
calls, therefore, hypervisors are encouraged to perform portions of the function in parallel using as many
partition processor threads as is practical.The hypervisor clears the partition’s HT entries (sets them to invalid) except for those entries mapping the
VRMA, or a Partition Adjunct, performs a TLBIA on all partition processor threads, and returns H_Success
on the calling thread.To avoid translation exceptions, attempting to access pages whose translations are being cleared, all OS
processor threads should be operating MSR[IR] = MSR[DR] = MSR[PR] = 0b0. Any attempt to use one of
the HPT access hcall(s) (See ) during the
clearing process might result in the an H_Busy return code, and/or the processor might be pressed into
service clearing the HPT.Syntax:Semantics:Disable other HPT access hcall()sFor each HPT entryIf the entry does not map the VRMA or Partition Adjunct clear the V bitFor each partition processor perform a TLBIAEnable other HPT access hcall()sReturn H_SuccessInterrupt Support hcall()s
below describes legacy vs exploitation differences related to the client
architecture, device tree and hcalls. The symbol @ESB refers to a logical
real address returned from the hcall()
. The symbol
@TIMA refers to the logical real address in the
“reg”
property of the
External Interrupt Virtualization Node.
XIVE Legacy vs. Exploitation Mode Hypervisor Call Function TableLegacyExploitation ModeClient Architecture Support
Option Vector 5 Byte 23 bits 0-1 undefined or 0b00/chosen“ibm,architecture-vec” Byte 23 bits 0-1 undefined or 0b00See ibm,architecture vector 5, byte 23 in
for more details.
Client Architecture Support
Option Vector 5 Byte 23 bits 0-1 undefined or 0b00/chosen“ibm,architecture-vec” Byte 23 bits 0-1 value 0b01“ibm,get-xive”hcall() “ibm,set-xive”hcall()
hcall() “ibm,int-off”CI load double to @ESB + 0xD00See XIVE specification for details on the CI operations.“ibm,int-on”CI load double to @ESB + 0xC00“set-indicator” with indicator 9005Same“get-sensor-state” with indicator 9005SameH_EOICI load double to @ESB + 0xC00 if store EOI is not
enabled
CI store double to @ESB + 0x400 if store EOI is
enabled
CI store byte to @TIMA + 0x11 of pre-interrupt
CPPRH_CPPRCI store byte to @TIMA + 0x11 of new CPPRH_IPICI store byte to @ESB + 0x00 H_IPOLL CI load byte from @TIMA + 0x10H_XIRRCI load half word from @TIMA + 0x810H_XIRR-XDeprecatedH_VIO_SIGNALSameThe following are clarified in the hcall definition:A race between a VIO virtual adapter generating a new
interrupt and a H_VIO_SIGNAL() or H_VIOCTL() hcall
could have multiple outcomes. H_VIO_SIGNAL()/H_VIOCTL()
wins and the interrupt does not occur, or the new interrupt
wins and one device interrupt occurs after H_VIO_SIGNAL()/H_VIOCTL()
call returns.Interrupting events that occur while H_VIO_SIGNAL()/H_VIOCTL()
has disabled interrupts will never generate an interrupt.H_VIOCTL with sub functions:
DISABLE_ALL_VIO_INTERRUPTS
DISABLE_VIO_INTERRUPT
ENABLE_VIO_INTERRUPTSame
Hypervisor Call Functions Unique to XIVE ExploitationFunction DescriptionHCALLGet the ESB addresses for a LISNAssign a target and priority to a LISNGet the target and priority assigned to a LISNGet the notification management page for a LISNSet/Reset an EQ for a target and priorityGet the EQ set for a target and prioritySet the OS reporting cache line pair for a targetGet the OS reporting cache line pair for a target Load or store operation on the ESB page for a LISNIssue the requested syncReset interrupt state to the initial state
Injudicious values written to the interrupt source controller may
affect innocent partitions. The following hcall()s monitor the
architected functions.H_EOISoftware Implementation Note: Issuing more H_EOI
calls than actual interrupts may cause undesirable behavior, including
but not limited to lost interrupts, and excessive phantom
interrupts.Syntax:Parameters:xirr: The low order 32 bits is the value to be written into the
calling processor’s interrupt management area’s external
interrupt request register (xirr).Semantics:If the platform implements the Platform Reserved Interrupt
Priority Level Option, and the priority field of the xirr parameter
matches one of the reserved interrupt priorities then return
H_Resource.If the value of the xirr parameter is such that the low order 3
bytes (xisr) is one of the interrupt source values assigned to the
partition, and the high order byte xirr byte (cppr) is equal or less
favored than the current cppr contents, then the value is written into
the calling processor’s xirr causing the interrupt source
controller to signal an “end of interrupt” (EOI) to the
specified interrupt source logic, then hypervisor returns H_Success or
H_Hardware (if an unrecoverable hardware error occurred). If the xirr
value is not legal, hypervisor returns H_Parameter.If the Shared Logical Resource option is implemented and the xirr
parameter represents a shared logical resource location that has been
rescinded by the owner, return H_RESCINDED.If the partition is not in XIVE legacy mode, the Hypervisor returns
H_Function.H_CPPRSyntax:Parameters:cppr: The low order byte is the value to be written into the
calling processor’s interrupt management area’s current
processor priority register (cppr).Semantics:If the platform implements the Platform Reserved Interrupt
Priority Level Option, and the priority field of the xirr parameter
matches one of the reserved interrupt priorities then return
H_Resource.The value of the cppr parameter is written into the calling
processor’s cppr causing the interrupt source controller to reject
any interrupt of equal or less favored priority. Then hypervisor returns
H_Success or H_Hardware (if an unrecoverable hardware error
occurred).If the partition is not in XIVE legacy mode, the Hypervisor returns
H_Function.H_IPISyntax:Parameters:server#: The server number gotten from the
“ibm,ppc-interrupt-server#s”
property associated with the processor and/or thread
to be interrupted.mfrr: The priority value the inter-processor interrupt to be
signaled.Semantics:If the platform implements the Platform Reserved Interrupt
Priority Level Option, and the priority field of the xirr parameter
matches one of the reserved interrupt priorities then return
H_Resource.If the value of the server# parameter specifies of one of the
processors in the calling processor’s partition, then the value in
the low order byte of the mfrr parameter is written into the mfrr
register (BA+12) of the processor’s interrupt management area
causing that interrupt source controller to signal an
“inter-processor interrupt” (IPI) to the processor associated
with the specified interrupt management area. Hypervisor then returns
H_Success or H_Hardware (if an unrecoverable hardware error occurred). If
the server# value is not legal, hypervisor returns H_Parameter.If the Shared Logical Resource option is implemented and the
server# parameter represents a shared logical resource location that has
been rescinded by the owner, return H_RESCINDED.If the partition is not in XIVE legacy mode, the Hypervisor returns
H_Function.H_IPOLLSyntax:Parameters:server#: The server number gotten from the
“ibm,ppc-interrupt-server#s”
property associated with the processor and/or tread
to be interrupted.Semantics:If the value of the server# parameter specifies of one of the
processors in the calling processor’s partition, then hypervisor
reads the 4 byte contents of the processor’s interrupt management
area port at offset BA+0 into the low order 4 bytes of register R4 and
the one byte of the mfrr (BA+12) into the low order byte of R5. Reading
these addresses has no side effects and is used to poll for pending
interrupts. Hypervisor then returns H_Success or H_Hardware (if an
unrecoverable hardware error occurred). If the server# value is not
legal, hypervisor returns H_Parameter.If the Shared Logical Resource option is implemented and the
server# parameter represents a shared logical resource location that has
been rescinded by the owner, return H_RESCINDED.If the partition is not in XIVE legacy mode, the Hypervisor returns
H_Function.H_XIRR / H_XIRR-XThese hcall()s provide the same base function that is they return
the interrupt source number associated with the external interrupt.
H_XIRR-X further supplies the time stamp of the interrupt . Legacy
implementations implement only H_XIRR, returning H_Function for a call to
H_XIRR-X. POWER8 implementations also implement H_XIRR-X.Syntax:Parameters:H_XIRR: no input parameters defined.H_XIRR-X: cppr: the internal current processor priority of the
calling virtual processor. Valid values in the range of 0x00 - most
favored to 0xFF - least favored less those values specified by the
“ibm,plat-res-int-priorities” property in
the root node).Semantics:Hypervisor reads the 4 byte contents of the processor’s
interrupt management area port at offset BA+4 into the low order 4 bytes
of the register R4. Reading this address has the side effect of accepting
the interrupt and raising the current processor priority to that of the
accepted interrupt.Place the timestamp when the hypervisor first received the
interrupt into R5.Hypervisor then returns H_Success or H_Hardware (if an
unrecoverable hardware error occurred).If the partition is not in XIVE legacy mode, the Hypervisor returns
H_Function.H_INT_GET_SOURCE_INFOThe H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical real
address of the MMIO page through which the Event State Buffer entry
associated with the value of the “lisn” parameter is managed.
The initial state of the ESB PQ bits will be the architected off value
of 0b01. The “lisn” parameter can come from several different properties
or hcalls. For example, the “lisn” parameter value for I/O adapters is
passed to a partition through the
“interrupts” and
“interrupt-map” properties
in the device tree node describing the I/O adapter. Alternatively, for
inter processor interrupts, the “lisn” parameter is a value the OS chooses
from a range of LISNs from the
“ibm,xive-lisn-ranges” property.
While for platform accelerators, the “lisn” parameter is a value returned
by the allocating hcall(), H_ALLOCATE_VAS_WINDOW. Depending upon the specific
Logical Interrupt Source, there might be either one or two page addresses
assigned to the Logical Interrupt Source as indicated by the returned values
of this hcall(). The hcall() returns four values in addition to the return code.
The first value is the logical real address of the full function page address,
which allows both trigger and reset functions. The second value is either the
logical real address of the trigger only page, or the reserved value -1 (all ones).
The value of -1 indicates that the Event Source Buffer does not have a trigger only page.
See the XIVE specification for more details on the full function
page versus the trigger only page.Syntax:Parameters:flags: bits 0-63 reserved“lisn” is per
“interrupts”,
“interrupt-map”, or
“ibm,xive-lisn-ranges” properties,
or as returned by the
ibm,query-interrupt-source-number
RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcallReturn Values:R4: “flags”:
Bits 0-59: ReservedBit 60: ESB hcall: ESB hcall==1, hcall() H_INT_ESB must be
used for Event State Buffer managementBit 61: LSI: LSI==1, the interrupt associated with the
“lisn” is a LSI (Level Sensitive Interrupt), LSI==0, the
interrupt associated with the “lisn” is a MSI (Message Signaled Interrupt)Bit 62: Trigger: Trigger==1, the full function page supports triggerBit 63: Store EOI SupportedR5: Logical Real address of full function Event State Buffer
management page, -1 if ESB hcall flag is set to 1.R6: Logical Real Address of trigger only Event State Buffer
management page or -1 if ESB hcall flag is set to 1.R7: Power of 2 page size for the ESB management pages returned
in R5 and R6. For example, a 4K page size is represented by the value
of 12 (4K = 212). There is a minimum page size of 4K.Semantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “lisn” parameter per the list of interrupt sources
allocated to the calling partition else return H_P2.Load R4 with the return flags, setting the reserved bits to 0.Load R5 with the logical real address of the full function Event
State Buffer management page.If the associated Event State Buffer has two management pages
defined load the logical real address of the trigger only page into
R6 else load R6 with -1.Load R7 with the power of 2 page size of the ESB management pages.Return H_Success.H_INT_SET_SOURCE_CONFIGThe H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical Interrupt
Source to a target. The Logical Interrupt Source is designated with the
“lisn” parameter and the target is designated with the “target” and
“priority” parameters. Upon return from the hcall(), no additional interrupts
will be directed to the old EQ. The old EQ should be investigated for
nterrupts that occurred prior to or during the hcall().Syntax:Parameters: “flags”
Bits 0-61: ReservedBit 62: setEisn: setEisn==1, set the “eisn” in the EABit63: M: m==1 masks the interrupt source in the hardware
interrupt control structure.As defined in Section 3.7 "Processing an EAS" in
the XIVE Specification.
An interrupt masked by this mechanism will be dropped, but
it's source state bits will still be set. There is no race-free
way of unmasking and restoring the source. Thus this should only
be used in interrupts that are also masked at the source, and
only in cases where the interrupt is not meant to be used for
a large amount of time because no valid target exists for it
for example
“lisn” is per
“interrupts”,
“interrupt-map”, or
“ibm,xive-lisn-ranges” properties,
or as returned by the
ibm,query-interrupt-source-number RTAS call, or
as returned by the H_ALLOCATE_VAS_WINDOW hcall“target” is per
“ibm,ppc-interrupt-server#s” or
“ibm,ppc-interrupt-gserver#s”“priority” is a valid priority not in
“ibm,plat-res-int-priorities”“eisn” is the guest EISN associated with the “lisn”Return Values:NoneSemantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “lisn” parameter per the list of interrupt sources
allocated to the calling partition else return H_P2.If “priority” is not 0xFF
Validate the “target” parameter per the list of threads
allocated to the calling partition else return H_P3.If the partition thread count is greater than the hardware
thread count, validate the “target” has a corresponding hardware
thread else return H_Not_Available.Validate the “priority” parameter is a valid priority
and not in listed in the
“ibm,plat-res-int-priorities”
property else return H_P4.Fill the Event Assignment Structure associated with “lisn” with:
Block and Event Notification Descriptor Table Index
associated with “target”/”priority” pair.Set the “M” bit to the value of the flags “M” bit.If setEisn==1, store “eisn”.Issue syncs required to ensure all in-flight
interrupts are complete.Else
Reset the Event Assignment Structure associated with “lisn” by:
Issue syncs required to ensure all in-flight interrupts
are completeInvalidating the Block and End Notification Descriptor
Table IndexResetting the eisnReturn H_Success.H_INT_GET_SOURCE_CONFIGThe H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
target/priority pair is assigned to the specified Logical Interrupt Source.Syntax:Parameters:“flags”: bits 0-63 Reserved “lisn” is per
“interrupts”,
“interrupt-map”, or
“ibm,xive-lisn-ranges” properties,
or as returned by the
ibm,query-interrupt-source-number RTAS call, or as returned by the
H_ALLOCATE_VAS_WINDOW hcallReturn Values:R4: Target to which the specified Logical Interrupt Source is
assigned, else this is undefined.R5: Priority to which the specified Logical Interrupt Source is
assigned, else this is set to 0xFF (disabled).R6: EISN for the specified Logical Interrupt Source (this will
be equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG).Semantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “lisn” parameter per the list of interrupt sources
allocated to the calling partition else return H_P2.Load R4 with the target associated with the “lisn” parameter.Load R5 with the priority associated with the “lisn” parameter.Load R6 with the EISN associated with the “lisn” parameter.Return H_Success.H_INT_GET_QUEUE_INFOThe H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
address of the notification management page7 associated with the
specified target and priority.
Syntax:Parameters:“flags”: bits 0-63 Reserved“target” is per
“ibm,ppc-interrupt-server#s” or
“ibm,ppc-interrupt-gserver#s”“priority” is valid priority not in the
“ibm,plat-res-int-priorities”Return Values:R4: Logical real address of notification pageR5: Power of 2 page size of the notification pageSemantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “target” parameter per the list of threads allocated
to the calling partition else return H_P2.If the partition thread count is greater than the hardware thread
count, validate the “target” has a corresponding hardware thread else
return H_Not_Available.Validate the “priority” parameter is a valid priority and not in
listed in the
“ibm,plat-res-int-priorities”
property else return H_P3.Load R4 with the ESn page from the Event Notification Descriptor
Table associated with “target” and “priority”.Load R5 with the page size of the ESn page.Return H_Success.H_INT_SET_QUEUE_CONFIGThe H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset an EQ for a
given “target” and “priority”. It is also used to set the notification
config associated with the EQ. An EQ size of 0 is used to reset the EQ
config for a given target and priority. If resetting the EQ config, the
END associated with the given “target” and “priority” will be changed to
disable queuing.Upon return from the hcall(), no additional interrupts will be directed
to the old EQ (if one was set). The old EQ (if one was set) should be
investigated for interrupts that occurred prior to or during the hcall().Syntax:Parameters:“flags”:
Bits 0-62: ReservedBit 63: Unconditional Notify (n) per the XIVE spec“target”: is per
“ibm,ppc-interrupt-server#s” or
“ibm,ppc-interrupt-gserver#s”“priority”: is valid priority not in the
“ibm,plat-res-int-priorities”“eventQueue”: The logical real address of the start of the EQ“eventQueueSize”: The power of 2 EQ size per
“ibm,xive-eq-sizes”Return Values:NoneSemantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “target” parameter per the list of threads allocated
the calling partition else return H_P2.If the partition thread count is greater than the hardware thread
count, validate the “target” has a corresponding hardware thread else
return H_Not_Available.Validate the “priority” parameter is a valid priority and not in listed in the
“ibm,plat-res-int-priorities”
property else return H_P3.Validate the “eventQueueSize” parameter per
“ibm,xive-eq-sizes”,
else return H_P5.Validate that if “eventQueueSize” is not 0 then the calling partition owns the logical real address in “eventQueue” for the length of “eventQueueSize” else return H_P4.If “Unconditional Notify” = 0 notification is conditioned by the
notification page from H_INT_GET_QUEUE_INFO.If the “eventQueueSize” is not 0 then:
The memory pointed to by “eventQueue” must be zeroed by the OS.The generation bit for the EQ will start at 1.The EQ page offset counter will start at 0.The EQ config will be set to “eventQueue” and “eventQueueSize”.If the “eventQueueSize” is 0 then:
The EQ config will be reset.Queuing of interrupts will be disabled.Issue syncs required to ensure all in-flight interrupts are complete.Return H_Success.H_INT_GET_QUEUE_CONFIGThe H_INT_GET_QUEUE_CONFIG hcall() is used to get an EQ and the EQ size
for a given target and priority. If requested via the “Debug” flag,
this will also return the current generation value and event queue offset.
Syntax:Parameters:"flags":
Bits 0-62: ReservedBit 63: Debug: Return debug data“target”: is per
“ibm,ppc-interrupt-server#s” or
“ibm,ppc-interrupt-gserver#s”“priority”: is valid priority not in the
“ibm,plat-res-int-priorities”Return Values:R4: “flags”:
Bit 0-62: ReservedBit 62: The value of Event Queue Generation Number (g) per
the XIVE spec if “Debug” = 1Bit 63: The value of Unconditional Notify (n) per the XIVE specR5: The logical real address of the start of the EQR6: The power of 2 EQ size per
“ibm,xive-eq-sizes”R7: The value of Event Queue Offset Counter per XIVE spec if
“Debug” = 1, else 0Semantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “target” parameter per the list of threads
allocated the calling partition else return H_P2.If the partition thread count is greater than the hardware thread
count, validate the “target” has a corresponding hardware thread
else return H_Not_Available.Validate the “priority” parameter is a valid priority and not in
listed in the
“ibm,plat-res-int-priorities”
property else return H_P3.Load R4 with the return flags, setting the reserved bits to 0.Load R5 with the logical real address of the EQ associated with
“target” and “priority”. Set to -1 if no EQ has
been specifiedLoad R6 with the size of the EQ associated with the “target” and
“priority”. Set to 0 if no EQ has been specified.If “Debug” = 1
Load the event queue generation number into the return flagsLoad R7 with the event queue offset counterUse the appropriate hardware facility to get an atomic
view of the generation number and offset counter.Return H_Success.H_INT_SET_OS_REPORTING_LINEThe H_INT_SET_OS_REPORTING_LINE hcall() is used to set the reporting
cache line pair for the calling thread. The reporting cache lines will
contain the OS interrupt context when the OS issues a CI store byte to
@TIMA+0xC10 8 to acknowledge the OS interrupt. The reporting cache lines
can be reset by inputting -1 in “reportingLine”. Issuing the CI store byte
without reporting cache lines registered will result in the data not being
accessible to the OS.Syntax:Parameters:“flags”: bits 0-63 Reserved“reportingLine”: The logical real address of the reporting cache line pairReturn Values:NoneSemantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.If the partition thread count is greater than the hardware thread count,
validate the “target” has a corresponding hardware thread else return
H_Not_Available.If the “reportingLine” is not -1
Validate the calling partition owns the logical real address
in “reportingLine” for two cache lines else return H_P2.Validate that the “reportingLine” is cached aligned, else
return H_P2.Set the “reportingLine” in the NVT associated with the input
“target”.If the “reportingLine” is -1
Reset the NVT’s reporting line.Return H_Success.H_INT_GET_OS_REPORTING_LINEThe H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical real
address of the reporting cache line pair set for the input “target”. If no
reporting cache line pair has been set, -1 is returned.Syntax:Parameters:“flags”: bits 0-63 Reserved“target”: is per
“ibm,ppc-interrupt-server#s”Return Values:R4: The logical real address of the reporting line if set, else -1Semantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “target” parameter per the list of threads allocated
the calling partition else return H_P2.Validate the thread indicated by “target” is online else
return H_Not_Available.If the partition thread count is greater than the hardware
thread count, validate the “target” has a corresponding hardware
thread else return H_Not_Available.Load R4 with the logical real address of the reporting line
associated with “target”. Load R4 with -1 if no reporting line has been set.Return H_Success.H_INT_ESBSyntax:Parameters:“flags”:
bits 0-62: Reservedbit 63: Store: Store=1, store operation, else load operation“lisn” is per
“interrupts”,
“interrupt-map”, or
“ibm,xive-lisn-ranges” properties, or as returned by the
ibm,query-interrupt-source-number
RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall“esbOffset” is the offset into the ESB management page for the load or store operation“storeData” is the data to write for a store operationReturn Values:R4: The value of the load if load operation, else -1Semantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “lisn” parameter per the list of interrupt sources
allocated to the calling partition else return H_P2.Validate the “esbOffset” parameter is valid per the XIVE Spec
else return H_P3.If bit 63 of flags is 0
Issue the load operation to the “esbOffset” of the ESB
management page associated with “lisn”.Load R4 with the results of the load operation.If bit 63 of flags is 1
Issue the store operation to the “esbOffset” of the ESB
management page associated with “lisn”, storing “storeData”.Load R4 with -1.Return H_Success.H_INT_SYNCThe H_INT_SYNC hcall() is used to issue hardware syncs that will
ensure any in flight events for the input lisn are in the event queue.Syntax:Parameters:“flags”: bits 0-63 Reserved“lisn” is per
“interrupts”,
“interrupt-map”, or
“ibm,xive-lisn-ranges” properties, or as returned by the
ibm,query-interrupt-source-number
RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcallReturn Values:NoneSemantics:Verify that no reserved flag bits are on else return H_Parameter.Verify that a H_INT_RESET is not in progress else return H_State.Validate the “lisn” parameter per the list of interrupt sources
allocated to the calling partition else return H_P2. Perform the appropriate hardware syncs to ensure any in flight
events for the input “lisn” are in the event queue.Return H_Success.H_INT_RESETThe H_INT_RESET hcall() is used to reset all of the partition’s interrupt
exploitation structures to their initial state. This means losing all p
reviously set interrupt state set via H_INT_SET_SOURCE_CONFIG and
H_INT_SET_QUEUE_CONFIG.
Syntax:Parameters:“flags”: bits 0-63 ReservedReturn Values:NoneSemantics:Verify that no reserved flag bits are on else return H_Parameter.Block all other exploitation hcalls (they all will return H_STATE
if called while H_INT_RESET is in progress).Verify that no other threads are currently in the middle of an
H_INT_RESET for this partition else return H_STATEReset the following:
All EAsAll ESB statesAll ENDs
Specifically, including clearing out its EQ pointer and sizeAll OS Reporting LinesUnblock all other exploitation hcalls when finished.Return H_Success.Memory Migration Support hcall()sTo assist an OS in memory migration, the following hcall() is
provided. During the migration process, it is the responsibility of the
OS to not change the DMA mappings referenced by the translations buffer
(for example by using the H_GET_TCE, H_PUT_TCE hcall()s, or other DMA
mapping hcall()s). Failure of the OS to serialize such DMA mapping access
may result in undesirable DMA mappings within the caller’s
partition (but not outside of the caller’s partition). Further, it
is the responsibility of the OS to serialize calls to the H_MIGRATE_DMA
service relative to the logical bus numbers referenced. Failure of the OS
to serialize relative to the logical bus numbers may result DMA data
corruption within the caller’s partition.On certain implementations, DMA read operations targeting the old
page may still be in process for some time after the H_MIGRATE_DMA call
returns; this requires that the OS not reuse/modify the data within the
old page until the worst case DMA read access time has expired. The
“ibm,dma-delay-time” property (see
) gives the OS this implementation
dependent delay value. Failure to observe this delay time may result in
data corruption as seen by the caller’s I/O adapter(s).R1--1.For the LPAR option supporting the hcall-migrate function
set: The platform must supply the
“ibm,dma-delay-time” property under the
/rtas node of the device tree.Memory pages may be simultaneously mapped by multiple DMA agents,
with different translation table formats and operation characteristics.
The H_MIGRATE_DMA hcall() atomically performs the memory migration
process so that the new page contains the old page contents (as updated
by any DMA write operations allowed during migration), with all DMA
mappings and engines directed to access the new page. The entries in the
mapping list contain the logical bus number associated with the mapping
and the I/O address of the mapping. From these two data, the hcall()
associates the using DMA agent, that agent’s DMA control
procedures, the specific mapping table and mapping table entry.R1--2.For the LPAR option supporting the hcall-migrate function
set: The platform must support migration of pages mapped for
DMA using any of the platform supported DMA agents.R1--3.For the LPAR option supporting the hcall-migrate function
set: All the platform’s DMA agents must support
mechanisms that enable the platform to meet the syntax, semantics and
requirements set forth in section 14.5.4.8.1.Implementation Note: The minimal hardware mechanisms
to support the hcall-migrate function set are to quiesce DMA operation,
flush outstanding data to their targets (both reads and writes), modify
their DMA mapping and re-enable operation utilizing said modified DMA
mapping without introducing unrecoverable operational failures. Provision
for the hardware to direct DMA write operations to both old and new pages
provides a significantly more robust implementation.It is the intent of this architecture to have all memory in the
platform have the capability to be migrated. However, on the rare
implementation that cannot meet that intent, the
“ibm,no-h-migrate-dma” property may be
provided in
memory nodes for which H_MIGRATE_DMA cannot be
implemented.R1--4.For the LPAR option supporting the hcall-migrate function
set: If a memory node cannot support H_MIGRATE_DMA, then that
memory node must contain the
“ibm,no-h-migrate-dma” property.For the I/O Super Page option the I/O page size is an attribute of
the specified LIOBN (I/O pages mapped by a given LIOBN are a uniform
size), also the syntax and semantics of H_MIGRATE_DMA are extended to
allow migration of I/O pages that are larger than 4K bytes and have more
than 256 xlates translation entries. Specifying more than 256 translation
entries requires a sequence of calls to H_MIGRATE_DMA with the same
“newpage” address. Making a call in the sequence with a
length parameter of zero terminates the operation - should this
termination happen after the start of the physical migration, the
resulting state of the calling partition’s memory is unpredictable.
Failure to make a continuing call in the sequence for more than one
second aborts the operation; again the resulting state of the calling
partition’s memory is unpredictable.The introduction of super pages introduces the case where portions
of the super page may be I/O mapped and thus require the use of
H_MIGRATE_DMA to move the logical super page from one physical page to
another even though the super page as a whole may not be I/O mapped. To
handle this case, the LIOBN value of 0xFFFFFFFF is reserved to allow the
specification, within an translations entry (passed to H_MIGRATE_DMA via
the xlates parameter), of a super page that is not currently I/O mapped.
In this case, the normally reserved byte at xlates entry offset 4 is used
to specify the power of two size of the super page.R1--5.For the I/O Super Page option: the platform must
support the setting by the client of byte 3 bit 0 of the
ibm,architecture.vec 5 as input to the
ibm,client-architecture-support method.H_MIGRATE_DMASyntax:Parameters:newpage (The Logical address of the new page to be the target of
the TCE translations)xlates (The Logical address of a list of translations against the
target page the format of this list is:List starts on a page (4 K) boundary.Contains up to 256 translation entries:First 4 bytes of a translation entry is the logical bus number as
from either the:“ibm,dma-window” propertyor the reserved LIOBN 0xFFFFFFFF.Next 12 bytes of a translation entry is the logical bus offset
(I/O bus address). The format of the I/O bus address is dependent upon
the DMA agent:For 32 bit PCI, the high order 8 bytes are reserved with the low
order 4 bytes containing a 4 K aligned address (low order 12 bits
=zero).For 64 bit PCI, the high order 4 bytes are reserved with the low
order 8bytes containing a 4 K aligned address (low order 12 bits
=zero).For the I/O Super Page option the very first translation entry
passed is for the largest I/O page to be migrated by this sequence of
calls; else all translation entries are for the single 4K byte logical
page being migrated. The first translation entry may either be a current
I/O mapping for the largest I/O page that the caller wishes to migrate,
or the first translation entry may use the reserved LIOBN number of
0xFFFFFFFF, with the next byte indicating the page size as 2**N where N
is the numeric value of the byte at offset 4 into the translation entry
with the low order 8 bytes of the translation entry being the logical
real address of the start of the page to be migrated (the low order N
bits = zero).length (Number of entries in translation list is less than or
equal to 256)If the total number of translation entries in the xlates list is
less than or equal to 256 then the “length” parameter is the
number of translation entries.For the I/O Super Page option and specifying more than 256
translation entries, the client makes a series of calls, each passing 256
translation entries with the “length” parameter being the
negative of the total number of translation entries yet to be passed
until there are less than or equal to 256 remaining then for the final
call in the initiating sequence the “length” parameter is
positive as above.Semantics:For the I/O Super Page option: determine if a migration operation
is in process for this “newpage” address:Then:If the previous hcall() for the migration operation was more than
1 second ago, return H_Aborted.If the length parameter value is zero then abort the migration
operation and return H_TERM.If the length parameter value is not the next expected in the
sequence return H_P3.Record the new xlatesIf the length parameter is less than zero return
H_CONTINUE.ElseIf the number of outstanding operations is more than an
implementation specific number as communicated in the
“ibm,vec-5” property then return
H_ResourceIf the length parameter is less than zero, initiate a new
migration operation for the “newpage” address. (Note
resources for the operation may be allocated at this point and freed when
the operation terminates either normally, in error, or via timeout.
Implementations may, in unusual cases, use a busy return code to wait for
the release of resources from an immanently completing operation.The first xlate entry specifies the length and starting address
of the page to be migrated, if this specification is invalid (unsupported
length, the address is invalid for the partition, or not aligned to the
length) return H_MEM_PARM.If the operation specifies more than an implementation specific
number of xlates as communicated in the
“ibm,vec-5” property then return
H_Resource.Check that the page to be migrated can be migrated, else
H_PARAMETER.Check that the newpage is within the allocated logical page range
of the calling partition and the address is aligned to the I/O page size
of the first translation entry passed else H_PARAMETER.If the Shared Logical Resource option is implemented and the
newpage parameter represents a shared logical resource location that has
been rescinded by the owner, return H_RESCINDED.The contents of the xlates buffer are checked.This may be done as each entry is used, or it may be done prior
to starting the operation.If the former, then partial processing must be backed out in the
case of a detected parameter error.If the later, then the translation entries must be copied into an
area that is not accessible by the calling OS to prevent parameter
corruption after they have been verified. The OS perceived reentrancy of
the function is not diminished if this option is chosen.The xlates buffer starts on a 4 K boundary within the
partition’s logical address range else H_PARAMETER.The length parameter is between (for the I/O Super Page option:
the negative of the maximum number of xlate entries supported as
indicated in the
“ibm.architecture-vec-5” property of the
/chosen device tree node else 1) and 256 else
H_PARAMETER.For the I/O Super Page option: the length of the physical page to
be migrated is the length of the I/O page of the first translation entry;
else the length of the physical page to be migrated is 4K bytes.Each translation originally references the same physical page, or
a portion there of, else H_PARAMETER.Each logical bus offset is within the allocated range of the
calling partition else H_PARAMETER.If the Shared Logical Resource option is implemented and the
logical bus offset represents a shared logical resource location that has
been rescinded by the owner, return H_RESCINDED.Check the logical bus number:Is allocated to the calling partition else H_PARAMETER.Or: If the Shared Logical Resource option is implemented and the
logical bus number represents a shared logical resource location that has
been rescinded by the owner, return H_RESCINDED.For the I/O Super Page option: if the LIOBN implies a larger page
size than that specified by the first translation entry for this migrate
operation, place the index of the translation entry (0-255) into register
R4 and return H_PGSB_PARM.If the LIOBN referenced an unsupported DMA agent, place the index
of the translation entry (0-255) into register R4 and return
H_Function.If the logical bus number is not supported, return
H_PARAMETER.Note: The following is written from the perspective
of a PCI DMA agent; other DMA agents may require a different sequence of
operations to achieve equivalent results.The hypervisor disables arbitration for the IOA(s) associated
with the translation entries. (In some cases, where multiple IOAs share a
given TCE range, arbitration must be disabled for multiple IOAs. The
firmware assigned the bus address ranges to each IOA so knows which IOAs
correspond to which translation.)Waits for outstanding DMA write activity to complete. (This is
accomplished by doing a load from an appropriate register the bridge(s)
closest to the IOA -- when the load completes (dependency on load data is
satisfied) all DMA write activity has completed.)The hypervisor copies the contents of the 4 K page originally
accessed by the TCE(s) to the page referenced by the newpage
value.The hypervisor translates the logical address within the newpage
parameter and stores the resultant value in the TCE table entries
specified by the translation entries.Executes a sync operation to ensure that the new TCE data is
visible.The hypervisor enables arbitration on the IOA(s) associated with
the translation entities and returns H_Success.Implementation Notes:The firmware should be written to minimize the arbitration
disable time. The old page should be read into cache (possibly using the
data cache touch operations) prior to disabling the arbitration.
Implementation dependent algorithms can significantly improve the page
copy time.The firmware does not have to serialize this hcall() with other
hcall()s as long as it updates the TCE using atomic eight (8) byte write
operations. However, if the OS does not serialize this call with
H_PUT_TCE to the same TCE, and with other H_MIGRATE_DMA calls to the same
IOA(s) the calling LPARs DMA buffers could be corrupted.To minimize the effect of such unsupported DMA agents, the
platform designer should isolate such agents on their own bus with their
own “ibm,dma-window” property
specification.Performance Monitor Support hcall()sH_PERFMONTo manage the Performance Monitor Function:Syntax:Parameters:mode-set Platform specific modes to be set by this callmode-reset Platform specific modes to be reset by this
callSemantics:mode-set bit(s) check for platform specific validity else
H_PARAMETERmode-reset bit(s) check for platform specific validity else
H_PARAMETERif any mode-set bits are set, activate corresponding mode(s) - if
logically capable else H_RESOURCEif any mode-reset bits are on, deactivate corresponding mode(s) -
if logically capable else H_RESOURCEplace current state of platform specific modes in R4, return
H_SuccessDefined Perfmon mode bits:bit 0: 1= Enable Perfmonbit1: 0= Low threshold granularity 1= High threshold
granularityH_GET_DMA_XLATES_LIMITEDThis hcall returns the I/O bus address of the first entry defined
for the specified LIOBN and the corresponding logical address within the
range beginning with the Start logical address and less than the End
logical addresses, the search is limited to the range of I/O bus
addresses specified by the SIOBA and EIOBA parameters.R1--1.For the LRDR Option: The platform must implement the
H_GET_DMA_XLATES_LIMITED hcall() per the syntax and semantics specified
in section
.R1--2.For the LRDR Option: The platform must present the
“ibm,h-get-dma-xlates-limited-supported” property in
all PCI host bridge OpenFirmware nodes for which the
H_GET_DMA_XLATES_LIMITED hcall() is supported for all child
LIOBNs.Syntax:Parameters:Register R4: Logical I/O Bus Number (LIOBN)Bits 0-31are reserved and set to zero.Bits 32-63 contain a 32-bit unsigned binary integer that
identifies a translation which may have one or more entries that
translate to a page within a range specified by the Start and End logical
addresses.Register R5: Start Logical Address (SLA)Register R6: End Logical Address (ELA)Register R7: Start I/O Bus Address (SIOBA) of the translation
specified by the LIOBNThe SIOBA register may specify a special value of -1 or a
starting IOBARegister R8: End I/O Bus Address (EIOBA) of the translation
specified by the LIOBNThe EIOBA register may specify a special value of -1 or an ending
IOBASemantics:Check that the specified LIOBN is supported and allocated to the
calling logical partition, else H_PARAMETER.Check that the specified start logical address (SLA) is within
the allocated range of the calling logical partition, and is designated
on a 4 K-byte boundary, else H_P2.Check that the End logical address (ELA) minus 4K is within the
allocated range of the calling logical partition, and is designated on a
4 K-byte boundary, else H_P3. (May point no further than one page beyond
the maximum partition logical real address in order to stay within the
partition yet include the last partition page in the range of the
test.)Check that the specified starting logical address (SLA) is less
than the specified ending logical address (ELA), else H_P2.Check that the page specified by the logical addresses within the
specified range is within the allocated range of the calling logical
partition and the address is 4 K-byte aligned else H_P2.Check the content of SIOBAIf a value other than -1 is specified, check that the specified
start I/O bus address (SIOBA) is not outside of the range of IOBAs for
the specified LIOBN, else H_P4.If the SIOBA specifies a value of -1, the hypervisor starts the
search at the lowest IOBA in the translation table, otherwise the search
starts at the address specified by the SIOBA.Check the content of EIOBAIf a value other than -1 is specified, check that the specified
ending I/O bus address (EIOBA) is not outside of the range of IOBAs for
the specified LIOBN, else H_P5.If the EIOBA specifies a value of -1, the hypervisor ends the
search at the highest IOBA in the translation table, otherwise the search
ends at the address specified by the EIOBA.Outputs:Place the I/O bus address and corresponding logical address into
the respective registers:Register R4: I/O Bus Address (IOBA)This register contains a 64-bit unsigned binary integer that
specifies the I/O bus address of the page within the specified logical
address range for the specified LIOBN.The IOBA is returned when the hcall() completes with either
H_PARTIAL, H_PAGE_REGISTERED, or H_IN_PROGRESS return codes.Register R5: Corresponding Logical Address (CLA)This register contains a 64-bit unsigned binary integer that
designates the logical address of a page within the specified range that
corresponds to the I/O bus address.If the hcall() completes with H_IN_PROGRESS return code, the
corresponding logical address (CLA) is not returned.When the hcall() completes with H_PARTIAL or H_PAGE_REGISTERED
return code:The I/O bus address (IOBA) and corresponding logical address
(CLA) are returned.When the hcall() completes with H_PAGE_REGISTERED return
code:The I/O bus address (IOBA) is for the final page of the
translation table for the specified LIOBN as limited by the EIOBA
parameter.When the hcall() completes with H_IN_PROGRESS return code:The current IOBA being searched against the specified range is
returned, but the corresponding logical address is not returned.The hcall can be reissued by specifying the IOBA as the starting
IOBA without incrementing the IOBA by the resource page size.Firmware Implementation Notes:When the H_GET_DMA_XLATES_LIMITED hcall() is issued, the
hypervisor searches the translation table designated by the specified
LIOBN, from the entry for SIOBA through the entry for EIOBA in IOBA
order, for the entries that translate to a page within a given range of
logical addresses. If an entry is found, the hcall() completes with the
H_PAGE_REGISTERED return code if the page found is the last entry in the
translation table, or the H_PARTIAL return code for all other pages, and
the IOBA with the corresponding logical address are returned in output
registers R4 and R5 respectively.The hypervisor searches the translation table in IOBA order, and
proceeds in that order until an entry that translates to a physical
address within the specified range of logical addresses is found, in
which case, the hcall() completes with H_PARTIAL or H_PAGE_REGISTERED
return code, or H_Success, if the end of the translation table, as
specified by the EIOBA parameter, is reached.Software Implementation Notes:When the hcall() completes with H_PARTIAL return code, the
stored IOBA is incremented by the page size of the resource corresponding
to the specified LIOBN, and then specified as the starting I/O bus
address on a subsequent call where the hypervisor would then proceed with
the search until the end of the translation table, specified by the EIOBA
parameter, is reached. The caller can accumulate a full list of the IOBAs
for the specified LIOBN that translate into the specified range of
logical addresses, which then forms part of the xlate translation entries
specified as an input to the H_MIGRATE_DMA function.When the hcall() completes with H_PAGE_REGISTERED return code,
this indicates that page is contained in the specified range of logical
addresses, and it is the last page of the translation table such that the
search for that LIOBN is complete.If a value other than -1 is specified in the starting I/O bus
address register, the program should check that the specified SIOBA value
is not the same as the returned IOBA.RTAS RequirementsRTAS function as specified in this architecture is still required for
LoPAR LPAR partition. RTAS is instantiated via an OF client interface
call. RTAS operates without memory translation, therefore, the OS should
instantiate it within the RMA, however, the OF client interface does not
enforce this limitation. The RTAS calling sequences remain unchanged.
However, in LPAR configurations RTAS code is implemented differently than
in non-LPAR systems. LPAR RTAS has a part which is replicated in each
partition, and since RTAS has the capability to manipulate hardware system
resources, RTAS has a part which is implemented in the hypervisor. In the
hypervisor, there is a check of the RTAS parameters for validity before
execution. Therefore, the function of the partition replicated RTAS call is
to martial the arguments and make the required hidden hcall()s to the
hypervisor. In a non-LPAR system, RTAS calls are assumed to be made with
valid parameters. This cannot be assumed with LPAR. The LPAR RTAS operates
by all the rules of non-LPAR RTAS relative to it running real, with real
mode pointers to arguments and the same serialization requirement relative
to a single partition. However, the hypervisor may not assume that the
caller is following these serialization rules, failure on the part of the
OS to properly serialize is allowed to cause unpredictable results within
the scope of the calling partition but may not affect the proper operation
of other platform partitions.The following is a list of RTAS functions that are not defined or
implemented when the LPAR option is active:restart-rtasR1--1.For the LPAR option: The platform must implement the
PowerPC External Interrupt option.R1--2.For the LPAR option: The Firmware must initialize each
processor’s interrupt management area’s CPPR to the most
favored level and its MFRR to the least favored level before passing
control of the processor to the OS.R1--3.For the LPAR option: The RTAS rules of serialization of
RTAS calls must only apply to a partition and not to the system.R1--4.For the LPAR option: The hypervisor cannot trust the
RTAS calls to have no errors, therefore, the hypervisor must check a
partition’s RTAS call parameters for validity.R1--5.For the LPAR option: RTAS must be instantiated within
the RMA of partition storage.R1--6.For the LPAR option:
RTAS arguments must
be within the RMA of partition storage unless specifically specified in the
RTAS call definition.R1--7.For the LPAR option: If one or more hcalls fail due to
hardware error (return status -1), the platform must make available, prior
to the completion of the next boot sequence, via an
event-scan/check-exception, an error log indicating
the hardware FRU responsible for such failures. Due to the asynchronous
nature of error analysis, there is not a direct correlation between the log
and a specific failing hcall(), indeed the error log may precede the
failing hcall().OF RequirementsThe hypervisor is initialized and configured prior to the loading of
OF into the partition and boot of any client program (OS) in the partition
by OF. The NVRAM data base that describes the platform’s partitioning
is used to trigger the loading and initialization of the hypervisor. When
Logical Partitioning is enabled, a copy of OF code is loaded into each
partition where it builds the per partition device tree within the
partition’s RAM. The per partition device tree contains only entries
for platform components actually assigned to or used by the partition. The
invocation of the subset of the OF Client interface specified below appears
the same to the OS image regardless of the state of the LPAR option.A model of the boot sequence is as follows:Support processor runs chip tests and configures the CPU
chips.The support processor loads the boot ROM image into System Memory
along with the configuration information.POST codeInitialization FirmwareHardware configuration reporting structuresOFHypervisor RTASboot ROM executes POST and Initialization Firmware.Processor initialization code synchronizes the time bases of all
platform processors to a small value (approaching zero).Initialization Firmware accesses the NVRAM Partition Database to
determine if the LPAR option is enabled.Initialization Firmware initializes the hypervisor.The hypervisor configures itself using the hardware configuration
reporting structures.The hypervisor configures the various partitions with resources as
required by the NVRAM Partition Database.The hypervisor loads a copy of OF into each partition passing to
OF a resource reporting structure known as the NACA/PACA.OF notices in the NACA/PACA that a specific partition table is
specified.OF Scans the configuration and walks the buses to build the
partition device tree.OF requests the specific partition table from the NVRAM Partition
Database.OF loads RTAS into the partition’s memory.OF pulls in the configuration variables from the
partition’s NVRAM area and uses them to determine the
partition’s boot device.OF then loads the client program and starts executing it with one
of the partition’s processors.The client program notices that it is running on a LPAR capable
machine but does not have the hypervisor bit on in the MSR so must use
hcall() routines for its PFT and TCE accesses. The presence of the
“ibm,hypertas-functions” property is a
duplicate indication of LPAR mode.R1--1.For the LPAR option: The OF code state must be retained
after all partitions are initialized pending future boot requests.R1--2.For the LPAR option: The OF code must recognize that
logical partitioning is required as opposed to a non-LPARed system.R1--3.For the LPAR option: The OF must generate the device
tree for the partition within the partition’s RAM.R1--4.For the LPAR combined with Dynamic Reconfiguration
option: The
“interrupt-ranges” property for any
reported interrupt source controller must report all possible interrupt
source numbers.R1--5.For the LPAR option: The OF device tree for a partition
must include in the root node, the
“ibm,partition-no” property.R1--6.For the LPAR option: The OF device tree for an LPAR
capable model not running in a partition must include in the root node, the
“ibm,partition-no” property when the
default partition number for the first partition created is not 1.R1--7.For the LPAR option: The
“ibm,partition-no” property value must be
an integer in the range of 1 to 220-1.R1--8.For the LPAR option: The OF device tree for a partition
must include in the root node, the
“ibm,partition-name” property.R1--9.For the LPAR option: When the platform does not provide
a partition manager and the one and only partition in the system owns all
the partition visible system resources, then the default value of the
“ibm,partition-name” property must be the
content of the SE keyword (as displayed in the same form as the root node
“system-id” property) with a hyphen added
between the plant of manufacture and sequence number.R1--10.For the LPAR option: The nodes of the OF device tree
for a partition that represent platform resources that are not explicitly
allocated for the control of the platform’s OS image must be marked
“used-by-rtas”. This includes, but is not limited to, memory
controllers, and IO bridges that are a part of the platform’s
infrastructure common to more than one partition and commonly represented
in the OF device tree. But does not include read only resources such as
environmental sensors.R1--11.For the LPAR option: The OF must, at the OS’s
request, load the required RTAS into the partition’s real addressable
memory region.R1--12.For the LPAR option: The OF must use the
partition’s segment of the NVRAM to establish the partition’s
boot device and configuration variables.R1--13.For the LPAR option: The OF must load the client
program and choose the partition’s processor on which to begin
execution.Note: It is the responsibility of the client program to recognize
whether or not to use LPAR page management.R1--14.For the LPAR option: The platform must initialize the
time base of the first processor to a small (approaching zero) value prior
to turning over control of the processor to a client program.R1--15.ReservedR1--16.For the LPAR option: The OF Client Interface must
restrict access to only resources contained within the calling
partition’s version of the device tree.R1--17.For the LPAR option: The OF Client Interface must
prevent the calls of one partition’s client program from interfering
with the operation of another partition’s client program.R1--18.For the LPAR option: The OF Client Interface must
restrict its supported calls and methods to those specified in
.R1--19.For the LPAR option: Any hidden hcall()s which firmware
may use to implement OF functions must check its parameters to insure
compliance with all of the architecturally mandated OF requirements.R1--20.For the LPAR option: The OF Client Interface functions
“start-cpu” and “resume-cpu” must restrict their
operation to processors assigned to the calling Client’s
partition.
OF Client Interface Functions Supported under the LPAR
Optiontestcannonchildfinddevicegetpropgetpropleninstance-to-packageinstance-to-pathnextproppackage-to-pathparentpeersetpropcall-methodtest-methodcloseopenreadseekwriteclaimreleasebootenterexitstart-cpumillisecondssize(/chosen/nvram)get-timeinstantiate-rtas
NVRAM RequirementsThe NVRAM is divided into multiple partitions each containing
different categories of data similar to files in a file system (these NVRAM
partitions are not to be confused with LPAR partitions). Each NVRAM
partition is structured with a self identifying header followed by its
partition unique data. Many of these NVRAM partitions contain data only
relevant to the platform firmware, while others contain data that either is
for OS image use from boot to boot or is used to communicate operational
parameters from the OS image to the platform. The platform firmware on LPAR
supporting platforms structures the NVRAM as per
. Each LPAR partition is assigned
a region of NVRAM space. This includes space for LPAR partition specific
configuration variables as well as the minimum 4 K space reserved for the
OS image. The hypervisor restricts access for the LPAR partition, through
logical address translation and range checking, to its assigned NVRAM
region. Other regions of NVRAM are reserved for firmware use including, for
instance, information about how the system should be partitioned.
LPAR NVRAM MapReal Address RangePer Partition NVRAM access routine rtas
call Logical Address Range -- outside of legal range return
0x00 and discard write data.Contents0x00 to F-1NAFirmware only partitions (Signatures 0x00 to 0x6F)F to (F-1+P)0x00 to PPer LPAR partition copies of supported NVRAM partitions
with signatures 0x70 to 0x7F(F+P) to (F-1+2P)0x00 to PPer LPAR partition copies of supported NVRAM partitions
with signatures 0x70 to 0x7F...(F+(P*(n-1)))to((F-1)+ nP)0x00 to PPer LPAR partition copies of supported NVRAM partitions
with signatures 0x70 to 0x7F
NVRAM partitions on LPAR platformsVisible to:Partition SignaturesPartition NameCommentsOnly to the Platform firmware0x00 - 0x6FOnly to Platform firmware and the OS image running in the
owning LPAR Partition.The read and write NVRAM RTAS routines0x70CommonThis partition is duplicated per partition.0x7F0x7777777777777777-77777777This partition is duplicated per partition and is at least
4 KB long when the OS is first installed.
R1--1.For the LPAR option: Platform OF must locate
configuration variables that the OS must manipulate to select options as to
how the specific OS image interfaces or relates to the platform in the
partition’s “system” partition signature (0x70) named
“common”, specifically none may be located in the
“OF” signature (0x50).R1--2.For the LPAR option: The NVRAM region assigned to an
LPAR partition must contain, after any platform required NVRAM partitions
have been allocated, a free space partition a minimum of 4 KB long prior to
the installation of the partition’s OS image.Administrative Application Communication RequirementsThe platform needs to communicate with the an administrative
application (outside of the scope of LoPAR) to manage the platform
resources. The administrative application may run in an external computer
such as a Hardware Management Console, or it may be integrated into a
service partition. Many system facilities are not dedicated to an LPAR
partition but are managed through the HMC and the administrative
application.R1--1.For the LPAR option: The platform must provide a
communications means to the administrative application.R1--2.For the LPAR option: The platform must respond to
messages received from the administrative application.RTAS Access to Hypervisor Virtualized ResourcesAll allolcatable platform resources are always assigned to a
partition. There always exists a dummy partition that is never active.
Resources assigned to partitions that are inactive may be reassigned to
other partitions by mechanisms implemented in the Hardware Management
Console.R1--1.For the LPAR option: The nvram-fetch RTAS call must
restricted access to only the LPAR partition’s assigned
“OS”, “System” and “Error Log” nvram
partitions.R1--2.For the LPAR option: The nvram-store RTAS call must
restricted accss to only the LPAR partition’s assigned
“OS”, “System” and “Error Log” nvram
partitions.R1--3.For the LPAR option: The get-time-of-day RTAS call must
return the LPAR partition’s specific time of day clock value.R1--4.For the LPAR option: The set-time-of-day RTAS call must
set the LPAR partition’s specific time of day clock value.Firmware Implementation Note: The model implementation keeps time of
day on a partition basis. What is really changed is the offset from the
hardware TOD clock which is not normally written (Only written if for some
reason it is approaching its maximum value, such as after a battery
failure).R1--5.For the LPAR option: The event-scan RTAS call must
report global events to each LPAR partition and LPAR partition local events
only to the affected LPAR partition.R1--6.For the LPAR option: The check-exception RTAS call must
report global events to each LPAR partition and LPAR partition local events
only to the affected LPAR partition.R1--7.For the LPAR option: The rtas-last-error RTAS call must
report only RTAS errors affecting the calling LPAR partition.R1--8.For the LPAR option: The
ibm,read-pci-config RTAS calls must restrict access to
only IOAs assigned to the calling LPAR partition, and if the configuration
address is not available to the caller, must return a status of Success
with all ones as the output value.R1--9.For the LPAR option: The ibm,write-pci-config RTAS
calls must restrict access to only IOAs assigned to the calling LPAR
partition, and if the configuration address is not available to the caller,
must be ignored and must return a status of Success.R1--10.For the LPAR option: The ibm,write-pci-config RTAS
calls must prevent changing of the firmware assigned interrupt message
number on IOAs configured to use message signaled interrupts.R1--11.For the LPAR option: The platform must virtualize the
display-character RTAS call such that the operator can distinguish and
selectively read messages from each partition without interference with
messages from other partitions.R1--12.For the LPAR option: The set-indicator RTAS call must
restrict access to only indicators assigned to the calling LPAR
partition.R1--13.For the LPAR option: The effects of the
system-reboot RTAS call must be restricted to only the
calling LPAR partition.Firmware Implementation Note: One standard OS response to a machine
check is to reboot. Thus expecting the firmware to reset any error
conditions such as in the I/O sub-system. When the I/O sub-system, or parts
thereof, are shared among multiple partitions, the platform cannot allow
the boot of one partition to prevent another partition from detecting that
it was also affected by an I/O error.R1--14.For the LPAR option: The platform must deliver machine
check and other event notifications to all affected partitions before
initiating recovery operations such as rebooting and resetting hardware
fault isolation circuits.R1--15.For the LPAR option: The
start-cpu RTAS call must be restricted to only the
processors assigned to the calling LPAR partition.R1--16.For the LPAR option: The
query-cpu-stopped-state RTAS call must be restricted to
only the processors assigned to the calling LPAR partition.R1--17.For the LPAR option: The
power-off and
ibm,power-off-ups RTAS calls must deactivate the
calling partition and not power off the platform if other partitions remain
active.R1--18.For the LPAR option: The
set-time-for-power-on RTAS call must activate the
platform when the partition requesting the earliest activation time is to
be activated.R1--19.For the LPAR option: The
ibm,os-term RTAS call must adjust support processor
surveillance to account for the termination of the LPAR partition’s
OS.R1--20.For the LPAR option:
The ibm,set-xive RTAS call must restrict access to only
interrupt sources assigned to the calling LPAR partition by silently
failing if the interrupt source is not owned by the calling partition
(return success without modifying the state of the unowed interrupt
logic).R1--21.For the LPAR option: The
ibm,set-xive RTAS call must restrict the written queue
values to only interrupt processors assigned to the calling LPAR
partition.R1--22.For the LPAR option:
The ibm,get-xive RTAS call must restrict access to only
interrupt sources assigned to the calling LPAR partition by silently
failing if the interrupt source is not owned by the calling partition
(return success with the least favored interrupt level, the interrupt
server number is undefined -- possibly all ones).R1--23.For the LPAR option:
The ibm,int-on RTAS call must restrict access to only
interrupt sources assigned to the calling LPAR partition by silently
failing if the interrupt source is not owned by the calling partition
(return success without modifying the state of the unowed interrupt
logic).R1--24.For the LPAR option:
The ibm,int-off RTAS call must restrict access to only
interrupt sources assigned to the calling LPAR partition by silently
failing if the interrupt source is not owned by the calling partition
(return success without modifying the state of the unowed interrupt
logic).R1--25.For the LPAR option: The
ibm,configure-connector RTAS call must restrict access
to only Dynamic Reconfiguration Connectors assigned to the calling LPAR
partition.R1--26.For the LPAR option: The platform must either define or
virtualize the power domains used by the set-power-level RTAS call such
that power level settings do not affect other partitions.R1--27.For the LPAR option: The set-power-level and
get-power-level RTAS calls must restrict access to only power domains
assigned to the calling partition.R1--28.For the LPAR option: The platform must restrict the
availability of the ibm,exti2c RTAS call to at most one partition (like any
IOA slot).R1--29.For the LPAR option: The ibm,set-eeh-option RTAS call
must restrict access to only IOAs assigned to the calling partition.R1--30.For the LPAR option: The ibm,set-slot-reset RTAS call
must restrict access to only IOAs assigned to the calling partition.R1--31.For the LPAR option: The ibm,read-slot-reset-state2
RTAS call must restrict access to only IOAs assigned to the calling
partition.R1--32.For the LPAR option: The
ibm,configure-bridge RTAS call must restrict access to
only configuration addresses assigned to the calling partition.R1--33.For the LPAR option: The ibm,set-eeh-option RTAS call
must restrict access to only IOAs assigned to the calling partition.R1--34.For the LPAR option: The platform must restrict the
ibm,open-errinjct, ibm,close-errinjct, and ibm,errinjct RTAS calls as well
as the errinjct properties be available on at most one partition as defined
by a platform wide firmware configuration variable.R1--35.For the LPAR option: Any hidden hcall()s which firmware
may use to implement RTAS functions must check its parameters to insure
compliance with all of the architecturally mandated RTAS
requirements.Shared Processor LPAR OptionThe Shared Processor LPAR (SPLPAR) option allows the hypervisor to
generate multiple virtual processors by time slicing a single physical
processor. These multiple virtual processors may be assigned to one or more
OS images. There are two primary customer advantages to SPLPAR over the
standard LPAR. Most obviously, the assigned processing capacity of the
partition can scale downwards to allow for more OS images to be supported
on a single platform. The second customer advantage is that a SPLPAR
platform can achieve higher processor utilization by providing partitions,
that can use extra processing capacity, with the spare capacity ceded from
other partitions. This allows the customer to take advantage of the
variable nature of the instantaneous load on any one OS image to achieve an
increase in the average utilization of the platform’s capacity. While
the peak capacity (directly related to the platform cost) stays constant,
the customer may see a significant improvement in the average capacity
among all the platform’s workloads. However, since the peak capacity
cannot be physically exceeded, the customer may experience a wider variance
in performance when exercising the SPLPAR option.In principal, the OS images running on the virtual processors of an
SPLPAR platform need not be aware that they are sharing their physical
processors, however, in practice, they experience significantly better
performance if they make a few optimizations. Specifically, if the OS
images cedes their virtual processor to the platform when they are idle,
and confers their processor to the holder of a spin lock for which the
virtual processor must wait. Another significant change due to SPLPAR is
that there may not be a fixed relationship between a virtual processor and
the physical processor that actualizes it. In those cases, such physical
information as location codes are undefined, affinity and associativity
values are indistinguishable, relationships to secondary caches are
meaningless, and any attempt by an OS to characterize the quality of its
processor (such as running diagnostics or performance comparisons to other
virtual processors) provide unreliable results. OF entities, that represent
physical characteristics of a virtual processor that do not remain fixed,
take on altered definitions/ requirements in an SPLPAR environment.To provide input to the capacity planning and quality of service
tools, the hypervisor reports to an OS certain statistics, these include
the minimum processor capacity that the OS can expect (the OS may cede any
unused capacity back to the platform), the maximum processor capacity that
the platform grants to the OS, the portion of spare capacity (up to the
maximum) that the platform grants to the OS, and the maximum latency to a
dispatch via an hcall().The OS image optionally registers a data area (VPA) for each virtual
processor using the H_REGISTER_VPA hcall(). The hypervisor maintains a
variable, within the data area, that is incremented each time the virtual
processor is dispatched/preempted, such that the dispatch variable is
always even when the virtual processor is dispatched and always odd when it
is not dispatched. The achitectural intent for the usage of the dispatch
count variable is describe below in the paragraph devoted to conferring the
processor. Additionally this hcall() may register a trace buffer which the
OS may activate to gain detailed information about virtual processor
preemption and dispatching.Both the VPA and the trace log buffer contain statistics on how long
the virtual processor has waited (not been dispatched on a physical
processor). Architecturally, the virtual processor wait time is divided
into three intervals:The time that the virtual processor waited to become logically
ready to run again, for example:The time needed to resolve a faultThe time needed to process a hypervisor preemptionThe time until a wake up event after voluntarily relinquishing the
physical processorThe time spent waiting after interval 1 until virtual processor
capacity was available. Shared processor partitions are granted a quantum
of virtual processor capacity (execution time) each dispatch wheel
rotation; thus if the partition has used its capacity, the ready to run
virtual processor has to wait until the next quantum is granted.The time spent waiting after interval 2 until the virtual
processor was dispatched on a physical processor. This is arises from the
fact that multiple ready to run virtual processors with virtual processor
capacity may be competing for a single physical processor.Two other performance statistics are available via hcall()s these are
the Processor Utilization Register, and Pool Idle Count returned by the
H_PURR and H_PIC hcall()s respectively. These two statistics are counts in
the same units as counted by the processor time base. Like the time base,
the PUR and PIC are 64 bit values that are set to a numerically low value
during system initialization. The difference between their values at the
end and beginning of monitored operations provides data on virtual
processor performance. The value of the PUR is a count of processor cycles
used by the calling virtual processor. The PUR count is intended to provide
an indication to the partition software of the computation load supported
by the virtual processor. SPLPAR virtual processors are created by
dispatching the virtual processor’s architectural state on one of the
physical processors from a pool of physical processors. The value of the
PIC is the summation of the physical processor pool idle cycles, that is
the number of time base counts when the pool could not dispatch a virtual
processor. The PIC count is intended to provide an indication to platform
management software of the pool capacity to perform more work.A well behaved OS image normally cedes its virtual processor to the
platform using the H_CEDE hcall() after it determines that it currently has
run out of useful work. The H_CEDE hcall() gives up the virtual processor
until either an external interrupt (including decrementer, and Inter
Processor Interrupt) or another one of the partition’s processors
executes an H_PROD hcall() see below. Note the decrementer appears to
continue to run during the time that the virtual processor is ceded to the
platform. The H_CEDE hcall() always returns to the next instruction,
however, prior to executing the next instruction, any pending interrupt is
taken. To simulate atomic testing for work, the H_CEDE call may be called
with interrupts disabled, however, the H_CEDE call activates the virtual
processor’s MSREE bit to avoid going into a wait state with interrupts
masked.A multi-processor OS uses two methods to initiate work on one
processor from another, in both cases the requesting processor places a
unit of work structure on a queue, and then either signals the serving
processor via an Inter-Processor interrupt to service the work queue, or
waits until the serving processor polls the work queue. The former method
translates directly to the SPLPAR environment, the second method may
experience significant performance degradation if the serving processor has
ceded. To provide a solution to this performance problem, the SPLPAR
provides the H_PROD hcall(). The H_PROD hcall() takes as a parameter the
virtual processor number of the serving processor. Waking a potentially
ceded or ceding processor is subject to many race conditions. The semantic
of the H_PROD hcall() attempts to minimize these race conditions. First the
H_CEDE and H_PROD hcall()s serialize on each other per target virtual
processor. Secondly by having the H_PROD firmware set a per virtual
processor memory bit before attempting to determine if the target virtual
processor is preempted. If the processor is not preempted the H_PROD
hcall() immediately returns, else the processor is dispatched and the
memory bit is reset. If the processor was dispatched, and subsequently the
virtual processor does a H_CEDE operation, the H_CEDE hcall() checks the
virtual processor’s memory bit and if set, resets the bit and returns
immediately (not ceding the physical processor to another virtual
processor). An OS might choose to always do an H_PROD after an enqueue to a
polled queue or it might qualify making the H_PROD hcall() with a status
bit set by the by the target processor when it decides to cede its virtual
processor.Locking in a SPLPAR environment presents a problem for
multi-programming OSs, in that the virtual processor that is holding a lock
may have been preempted. In that case, spinning, waiting for the lock,
simply wastes time since the lock holder is in no position to release the
lock -- it needs processor cycles and cannot get them for some period of
time and the spinner is using up processor cycles waiting for the lock. The
condition is known as a live lock, however, it eventually resolves itself.
The SPLPAR optimization to alleviate this problem is to have the waiting
virtual processor “confer” its processor cycles to the lock
holder’s virtual processor until the lock holder has had a chance to
run another dispatch time slice.As with the cede/prod pair of functions above, the confer function is
subject to timing window races between the waiting process determining that
the lock holder has been preempted and execution of the H_CONFER hcall()
during which time the originally holding virtual processor may have been
dispatched, released the lock and ceded the processor. To manage this
situation, the H_CONFER takes two parameters, one that specifies the
virtual processor(s) that are to receive the cycles and the second
parameter (valid only when a single processor is specified) which
represents the dispatch count of the holding virtual processor atomically
captured when the waiting processor decided to confer its cycles to the
waiting processor.The semantic of H_CONFER checks the processor parameter for validity,
then if it is the “all processors” code proceeds to the
description below. If the processor parameter refers to a valid virtual
processor owned by the calling virtual processor’s partition, that is
not dispatched, that has not conferred its cycles to all other processors,
and who’s current dispatch count matches that of the second
parameter, the time remaining from the calling processors time slice is
conferred to the specified virtual processor.If the first parameter of H_CONFER specifies the “all
processors” code, then it marks the calling virtual processor to
confer all its cycles until all of the partition’s virtual
processors, that have not ceded or conferred their cycles, have had a
chance to run a dispatch time slice. The “all processors”
version may be viewed as having the hypervisor record the dispatch counts
for all the other platform processors in the calling virtual
processor’s hypervisor owned “confer structure”, then
prior to any subsequent dispatch of the calling processor, if the confer
structure is not clear, the hypervisor does the equivalent of removing one
entry from the confer structure and calling H_CONFER for the specific
virtual processor. If the specific virtual processor confer is rejected
(because the virtual processor is running, ceded, conferred, or the
dispatch count does not match) then the next entry is tried until the
confer structure is clear before the originally calling virtual processor
is re-dispatched.Virtual processors may migrate among the physical processor pool from
one dispatch cycle to the next. OF device tree properties that relate the
characteristics of the specific physical processor such as location codes,
and other vital product data cannot be consistent and are not reported in
the nodes of type
cpu if the partition is running in SPLPAR mode. Most
processor characteristics properties such as time base increment rate, are
consistent for all processors in the system physical and virtual so are
still reported via their standard properties. Additionally nodes of type
L2 are not present in the tree since they are shared
with other virtual processors making optimizations based upon their
characteristics impossible. The Processor Identification Register (PIR)
should not be accessed by the OS since from cycle to cycle the OS may get
different readings, instead the virtual processor number (the number from
the
“ibm,ppc-interrupt-server#s” property,
contained in the nodes of type
cpu, associated with this virtual processor) is used
as the processor number to be passed as parameters to RTAS and hcall()
routines for managing interrupts etc.Software Note: When the client program (OS) first gets
control during the boot sequence, the virtual processor number of the
single processor that is operational is identified by the
/chosen node of the device tree. The
cpu nodes list the other virtual processors that the
first processor may start. These are started one at a time, giving the
virtual processor number as an input parameter to the call. As each
processor starts, it starts executing a program that picks up its virtual
processor number from a memory structure initialized by the processor that
called the start-cpu function. The newly started processor then records the
location of its per processor memory structure (where it saves its virtual
processor number) in one of the SPRG registers.Virtual Processor AreasThe per processor areas are registered with the H_REGISTER_VPA
hcall() that takes three parameters. The first parameter is a flags field
that specifies the specific sub function to be performed, the second is
the virtual processor number of one of the processors owned by the
calling virtual processor’s partition for whom the area is being
registered. The third parameter is the logical address, within the
calling virtual processor’s partition, of the contiguous logically
addressed storage area to be registered. Registered areas are aligned on
a cache line (l1) size boundary and may not span an LMB boundary and for
the CMO option may not span an entitlement granule boundary. The length
of the area is provided to the hcall() in starting in byte offset 4 of
the area being registered. The H_REGISTER _VPA hcall() registers various
types of areas, and after verifying the parameters, initializes the
structure’s variables.Per Virtual Processor Area: This area contains shared processor
operating parameters as defined in
. A shared processor LPAR aware
OS registers this area early in its initialization. The other types of
virtual processor areas can only be registered after the Per Virtual
Processor Area has been successfully registered. The minimum length of
the Per Virtual Processor Area is 640 bytes and the structure may not
span a 4096 byte boundary.Dispatch Trace Log Buffer: This area is optionally registered by
OS’s that desire to collect performance measurement data relative
to its shared processor dispatching. The minimum size of this area is 48
bytes while the maximum is 4 KB. See
for more detailsSLB Shadow Buffer: This area is optionally registered by OS’s
that support the SLB-shadow function set. The structure may not span a
4096 byte boundary. This function set allows the hypervisor to
significantly reduce the overhead associated with virtual processor
dispatch in a shared processor LPAR environment, and to provide enhanced
recovery from SLB hardware errors. See
for more details.Software Note: Registering, deregistering or changing
the value of a variable in one of the Virtual Processor Areas for a
different virtual processor (i.e. changing a value in the VPA of
processor A from processor B) may be problematic. In no cases is
partition integrity be compromised, but results may be imprecise if such
a change is made during the virtual processor preempt/dispatch window. If
the owning processor is started, registration or deregistration should
only be done by the owning processor, if the processor is stopped,
registration or deregistration can safely be done by other processors.
Also, for example, changing the number of persistent SLB Shadow Buffer
entries cause uncertainty in the number of currently valid SLB entries in
that virtual processor. In some cases, such as turning on and off
dispatch tracing, such uncertainty may be acceptable.Per Virtual Processor Area
Per Virtual Processor AreaByte OffsetLength in BytesVariable Description0x004Descriptor: This field is supplied for OS identification
use, it may be set to any value that may be useful (such as a
pattern that may be identified in a dump) or it may be left
uninitalized.Historic values include: 0xD397D7810x042 (unsigned)Size: The size of the registered structure (640)0x6 - 0x1718Reserved for Firmware Use0x18 - 0x1B4Physical Processor FRU ID0x1C - 0x1F4Physical Processor on FRU ID0x20 - 0x5756Reserved for Firmware Use0x58 - 0x5F8Virtual processor home node associativity changes
counters (changes in the 8 most important associativity
levels)0x60 - 0xAF80Reserved for Firmware Use0xB01Cede Latency Specifier0xB11Maintain EBB registers:=0 architected state of the event based branch facility may be
discarded at any time,=1 architected state of the event based branch facility must be
maintained, all other values are reserved0xB26Reserved For LoPAR Expansion0xB81Dispatch Trace Log Enable Mask: (Note this entry is valid
only if a Dispatch Trace Log Buffer has been registered). A
Trace Log Entry is created when the virtual processor is
dispatched following its preemption for an enabled
cause.=0 no dispatch trace loggingBit 7 =1 Trace voluntary (OS initiated) virtual processor
waitsBit 6 =1 Trace time slice preemptsBit 5 = 1 Trace virtual partition memory page
faults.All other values are reserved0xB91Bits 0-6 ReservedBit 7 = 0 -- Dedicated processor cycle donation
disabledBit 7 = 1 -- Dedicated processor cycle donation
enabled.0xBA1Maintain FPRs:=0 architected state of floating point registers may be
discarded at any time,=1 architected state of floating point registers must be
maintained,all other values are reservedNote: When set in conjunction with offset 0xFF the 128
bit VSX space is saved on processors supporting the VSX option
( 2.06 and
beyond).0xBB1Maintain PMCs:=0 architected state of performance monitor counters may
be discarded at any time,=1 architected state of performance monitor counters must
be maintained,all other values are reserved0xBC-0xD728Reserved For Firmware Use0xD8-0xDF8Any non-zero value is taken by the firmware to be the OS,
estimate, in PURR units, of the cumulative number of cycles
that it has consumed on this virtual processor, while idle,
since it was initialized.0xE0 - 0xFB28Reserved for Firmware Use0xFC2 (unsigned)Maintain #SLBs:This number of Segment Lookaside Buffer Registers (up to
the platform implementation maximum) are maintained, all others
(up to the platform implementing maximum) may be discarded at
any time.The value 0xFFFF maintains all SLBs0xFE1Idle:=0 The OS is busy on this processor=1 The OS is idle on this processorAll other values are reserved0xFF1Maintain VMX state:=0 architected state of the processor’s VMX
facility, may be discarded at any time=1 architected state of the processor’s VMX
facility, must be maintainedAll other values are reservedNote: When set in conjunction with offset 0xBA the 128
bit VSX space is saved on processorssupporting the VSX option (
2.06 and beyond).0x1004 (unsigned)Virtual Processor Dispatch Counter:(Even when virtual processor is dispatched odd when it is
preempted/ceded/conferred)0x1044 (unsigned)Virtual Processor Dispatch Dispersion Accumulator:Incremented on each virtual processor dispatch if the
physical processor differs from that of the last
dispatch.0x1088 (unsigned)Virtual Processor Virtual Partition Memory Fault Counter:
Incremented on each Virtual Partition Memory page fault.0x1108 (unsigned)Virtual Processor Virtual Partition Memory Fault Time
Accumulator: Time, in Time Base units, that the virtual
processor has been blocked waiting for the resolution of
virtual Partition Memory page faults.0x118 - 0x11F8Unsigned accumulation of PURR cycles expropriated by the
hypervisor when VPA byte offset 0xFE = 10x120 - 0x1278Unsigned accumulation of SPURR cycles expropriated by the
hypervisor when VPA byte offset 0xFE = 10x128 - 0x12F8Unsigned accumulation of PURR cycles expropriated by the
hypervisor when VPA byte offset 0xFE = 00x130 - 0x1378Unsigned accumulation of SPURR cycles expropriated by the
hypervisor when VPA byte offset 0xFE = 00x138 - 0x13F8Unsigned accumulation of PURR cycles donated to the
processor pool when VPA byte offset 0xFE = 10x140 - 0x1478Unsigned accumulation of SPURR cycles donated to the
processor pool when VPA byte offset 0xFE = 10x148 - 0x14F8Unsigned accumulation of PURR cycles donated to the
processor pool when VPA byte offset 0xFE = 00x150 - 0x1578Unsigned accumulation of SPURR cycles donated to the
processor pool when VPA byte offset 0xFE = 00x158-0x15F8Accumulated virtual processor wait interval 3 timebase
cycles. (waiting for physical processor availability)0x160 - 0x1678Accumulated virtual processor wait interval 2 timebase
cycles. (waiting for virtual processor capacity)0x168 - 0x16F8Accumulated virtual processor wait interval 1 timebase
cycles. (waiting for virtual processor ready to run)0x170 - 0x1778Reserved for Firmware Use0x178 - 0x17F8Reserved for Firmware Use0x180 - 0x1834For the CMO option: The OS may report in this field as a
hint to the hypervisor the accumulated number, since the
virtual processor was started, of ‘page in’
operations initiated for pages that were previously swapped
out.”0x184 - 0x1874Reserved for Firmware Use0x188 - 0x18F8Reserved for Firmware Use0x190 - 0x1978Reserved for Firmware Use0x198 - 0x217128Reserved for Firmware Use0x218 - 0x21F8Dispatch Trace Log buffer index counter.0x220 - 0x27F96Reserved for Firmware Use
R1--1.For the SPLPAR option: If the OS registers a Per
Virtual Processor Area, it must correspond to the format specified in
.Dispatch Trace Log BufferThe optional virtual processor dispatch trace log buffer is a
circularly managed buffer containing individual 48 byte entries, with the
first entry starting at byte offset 0. Therefore, the 4 byte registration
size field is overwritten by the first Trace Log Buffer entry. (Note the
hypervisor rounds down the dispatch trace log buffer length to a multiple
of 48 bytes and wraps when reaching that boundary.) A vpa location
contains the index counter that
the hypervisor increments each time that it makes a dispatch trace log
entry such that it always indicates the next entry to be filled. The low
order bits (modulo the buffer length divided by 48) of the counter
provide the index of the next entry to be filled, therefore, the buffer
wraps each (buffer length divided by 48 entries), while the high order
counter bits indicate how many buffer wraps have occurred. Prior to
enabling dispatch trace logging, the OS should initialize the vpa index
counter to the value of 0. The format of dispatch trace log buffer
entries is given in
.The architectural intent is that OS trace tools keep a shadow index
counter into the log buffer of the next entry to be filled by the
hypervisor. Prior to making an entry of their own, such tools compare
their index counters with that of the hypervisor from the vpa, if they
are equal, no preempts/dispatches have occurred since the last OS trace
hook. If the two index counters are not equal, then the OS trace tool
processes the intermediate time stamps into the OS’s trace log,
updating its dispatch trace log buffer index until all have been
processed, then the new trace entry is added to the OS’s trace log.
Note, because of races, the processor may be preempted just prior to the
OS trace tool adding the new trace log entry, to handle this case, the OS
trace tool can examine the dispatch trace log buffer index immediately
after the adding of the new trace log entry and if needed adjust its own
trace log. In the extremely unlikely event that the two counters are off
by trace buffer length divided by forty eight or more counts, the OS
trace tool can detect that a dispatch trace log buffer overflow has
occurred, and trace data has been lost.
Dispatch Trace Log Buffer EntryByte OffsetLength in BytesVariable Description0x01Reason Code for the virtual processor dispatch:0: The virtual processor was dispatched at the external
interrupt vector location to handle an IOA interrupt, Virtual
interrupt, or interprocessor interrupt.1: The virtual processor was dispatched to handle
firmware internal events.2: The virtual processor was dispatched at the next
sequential instruction due to an H_PROD call by another
partition processor.3: The virtual processor was dispatched at the DECR
interrupt vector due to a decrementer interrupt.4: The processor was dispatched at location specified in
load module (boot) or at the system reset interrupt vector.
(virtual yellow button).5: The virtual processor was dispatched to handle
firmware internal events6: The virtual processor was dispatched at the next
sequential instruction to use cycles conferred from another
partition processor7: The virtual processor was dispatched at the next
sequential instruction for its entitled time slice.8: The virtual processor was dispatched at the faulting
instruction following a virtual partition memory page
fault.10: The virtual processor was dispatched at the privileged
doorbell interrupt vector location to handle a privileged
doorbell interrupt.0x11Reason Code for virtual processor preemption:0: Not used (for compatibility with earlier versions of
the facility)1: Firmware internal event2: Virtual processor called H_CEDE3: Virtual processor called H_CONFER4: Virtual processor reached the end of its timeslice
(HDEC)5: Partition Migration/Hibernation page fault6: Virtual memory page fault0x2 - 0x32Processor index of the physical processor actualizing the
thread on this dispatch.0x4 - 0x74Time Base Delta between enqueued to dispatcher and actual
dispatch on a physical processor0x8 - 0xB4Time Base Delta between ready to run and enqueue to
dispatcher0xC - 0xF4Time Base Delta between waiting and ready to run
(preempt/fault resolution time)0x10 - 0x178Time Base Value at the time of dispatch/wait0x18 - 0x1F8For virtual processor preemption reason codes 5 & 6:
Logical real address of faulting page; else reserved.0x20 - 0x278SRR0: At the time of preempt/wait0x28 - 0x2F8SRR1: At the time of preempt/wait
R1--1.For the SPLPAR option: If the OS registers a Dispatch
Trace Log Buffer, it must correspond to the format specified in
.SLB Shadow BufferOn platforms supporting the SLB-Buffer function set, the OS may
optionally register an SLB shadow buffer area. When the OS takes this
option, it allows the hypervisor to optimize the saving of SLB entries,
thus reducing overhead and providing more processor capacity for the OS,
and also allows the platform to recover from certain SLB hardware faults.
When the OS registers an SLB shadow buffer for its virtual processor, the
processor’s SLB is architecturally divided into three categories
relative to their durability as depicted in
.OS may dynamically change M and N (for (N+1)*16 <= Length of SLB Shadow Buffer)Each category of SLB entries consists of 0-n contiguous
SLBs.Persistent Entries: The first N (starting at SLB index 0, N
specified by the numeric content of the first 4 bytes of the registered
SLB Shadow Buffer) SLBs are maintained persistent across all virtual
processor dispatches unless an unrecovered SLB error is noted. OS
maintains a shadow copy of those SLB entries in the registered SLB shadow
buffer. The OS sizes its SLB Shadow buffer for the largest number of
persistent entries it can ever maintain. If the OS registers an SLB
Shadow buffer, the hypervisor does not save the contents of the
Persistent entries on virtual processor preempt, cede, or confer. The OS
should minimally record as persistent the entries it needs to handle its
SLB fault interrupts to fill in required Volatile (and potentially)
Transient entries.Volatile Entries: The next M-N SLBs (beginning at the next higher
SLB index after the last Persistent entry up through the entry specified
by the “maintain#SLBs” parameter of the VPA) may disappear.
The OS needs to be prepared to recover these entries via SLB fault
interrupts. For performance optimization, the hypervisor normally
maintains the state of these entries across H_DECR interrupts and most
hcalls(), they may be lost on H_CEDE calls.Transient Entries: The platform makes no attempt to maintain the
state of these entries and they may be lost at any time.The OS may dynamically change the number of Persistent entries by
atomically changing the value of the 4 byte parameter at SLB Shadow
Buffer offset 0.The hypervisor does not explicitly check the value of this
parameter, however, the hypervisor limits the number of SLBs that it
attempts to load from the shadow buffer to the lesser of the maximum
number of SLB entries implemented by the platform, or the maximum number
of entries containable in the SLB Shadow buffer length when it was
registered.R1--1.For the SPLPAR option: If the OS
registers an SLB Shadow Buffer, it must correspond to the format
specified in
.Shared Processor LPAR OF ExtensionsShared Processor LPAR Function Sets in
“ibm,hypertas-functions”hcall-splparhcall-picSLB-BufferDevice Tree VariancesIf an SPLPAR implementation does not maintain a fixed relationship
between the virtual processor that it reports to the OS image in the OF
device tree properties and the physical processor that it uses to
actualize the virtual processor, then OF entities that imply a fixed
physical relationship are not reported. These may include those listed in
.
OF Variances due to SPLPAREntityVariance to standard
definition“ibm,loc-code” propertyIf the physical relationship between virtual processors
and physical processors is not constant this property is
omitted from the virtual processor’s node. If missing,
the OS should not run diagnostics on the virtual
processor“l2-cache” propertyIf the physical relationship between virtual processors
and physical processors is not constant the secondary cache
characteristics are not relevant and this property is omitted
from the virtual processor’s node.Nodes named
l2-cacheIf the physical relationship between virtual processors
and physical processors is not constant the secondary cache
characteristics are not relevant and this node is omitted from
the partition’s device tree.“ibm,associativity” propertyIf the physical relationship between virtual processors
and physical processors is not constant the
“ibm,associativity” property reflects
the same domain for all virtual processors actualized by a
given physical processor pool. Note, even though the
associativity of virtual processors may be indistinguishable,
the associativity among other platform resources may be
relevant.
R1--1.For the SPLPAR option: If the platform does not
maintain a fixed relationship between its virtual processors and the
physical processors that actualize them, then the platform must vary the
device tree elements as outlined in
.Shared Processor LPAR Hypervisor ExtensionsVirtual Processor Preempt/DispatchA new virtual processor is dispatched on a physical processor when
one of the following conditions happens:The physical processor is idle and a virtual processor was made
ready to run (interrupt or prod)If the subfunction is a Register VPA or a Deregister VPA or SLB
shadow buffer, verify that the proc-no parameter references an
offline virtual proc or that the proc-no parameter matches the
current virtual processor making the hcall, else return H_STATE
The old virtual processor exhausted its time slice (HDECR
interrupt).The old virtual processor ceded/conferred its cycles.When one of the above conditions occurs, the hypervisor, by
default, records all the virtual processor architected state including
the Time Base and Decrementer values and sets the hypervisor timer
services to wake the virtual processor per the setting of the
decrementer. The virtual processor’s Processor Utilization Register
value for this dispatch is computed. The VPA’s dispatch count is
incremented (such that the result is odd). Then the hypervisor selects a
new virtual processor to dispatch on the physical processor using an
implementation dependent algorithm having the following characteristics
given in priority order:The virtual processor is “ready to run” (has not
ceded/conferred its cycles or exhausted its time slice).Ready to run virtual processors are dispatched prior to waiting
in excess of their maximum specified latency.Of the non-latency critical virtual processors ready to run,
select the virtual processor that is most likely to have its working set
in the physical processor’s cache or for other reasons runs most
efficiently on the physical processor.If no virtual processor is “ready to run” at this time,
start accumulating the Pool Idle Count (PIC) of the total number of idle
processor cycles in the physical processor pool.Optionally, flags in the VPA may be set by the OS to indicate to
the hypervisor that selected architected state of the virtual processor
need not be maintained (that is, the contents of these architected
facilities may be lost at any time without notice). The hypervisor may
then optimize its preempt/dispatch routines accordingly. Refer to
and SLB Shadow Buffer
description for the definition of these flags and values. The hypervisor
modifies any such OS setable and readable processor state that is not
explicitly saved and restored on a virtual processor dispatch so as to
prevent a covert channel between partitions.When the virtual processor is dispatched, the virtual
processor’s “prod” bit is reset, the saved architected
state of the virtual processor is restored from that saved when the
virtual processor was preempted, ceded, or conferred, except for the time
base which retains the current value of the physical processor and the
decrementer which is reduced from the state saved value per current Time
Base value minus saved Time Base value. The hypervisor sets up for
computing the PUR value increment for the dispatch.At this time, the hypervisor increments the virtual
processor’s VPA dispatch count (such that the value is even). The
hypervisor checks the VPA’s dispatch log flag, if set, the
hypervisor creates a pair of log entries in the dispatch log and stores
the circular buffer index in the first buffer entry.If the virtual processor was signaled with an interrupt condition
and the physical interrupt has been reset, then the hypervisor adjusts
the virtual processor architected state to reflects that of a physical
processor taking the same interrupt prior to executing the next
sequential instruction and execution starts with the first instruction in
the appropriate interrupt vector. If no interrupt has been signaled to
the virtual processor or the physical interrupt is still active, then
execution starts at the next sequential instruction following the
instruction as noted by the hypervisor when the virtual processor ceded,
conferred, or was preempted.The Platform allocates processor capacity to a partition’s
virtual processors using the architectural metaphor of a “dispatch
wheel” with a fixed implementation dependent rotation period. Each
virtual processor receives a time slice each rotation of the dispatch
wheel. The length of the time slice is determined by a number of
parameters, the OS image has direct control, within constraints, over
three of these parameters (number of virtual processors, Entitled
Processor Capacity Percentage, Variable Processor Capacity Weight). The
constraints are determined by partition and partition aggregate
configurations that are outside the scope of this architecture. For
reference, partition definitions provide the initial settings of these
parameters while the aggregation configurations provide the constraints
(including the degenerate case where an aggregation encapsulates only a
single member LPAR).Entitled Processor Capacity Percentage: The percentage of a
physical processor that the hypervisor guarantees to be available to the
partition’s virtual processors (distributed in a uniform manner
among the partition’s virtual processors -- thus the number of
virtual processors affects the time slice size) each dispatch cycle.
Capacity ceded or conferred from one partition virtual processor extends
the time slices offered to other partition processors. Capacity ceded or
conferred after all of the partition’s virtual processors have been
dispatch is added to the variable capacity kitty. The initial, minimum
and maximum constraint values of this parameter are determined by the
partition configuration definition. The H_SET_PPP hcall() allows the OS
image to set this parameter within the constraints imposed by the
partition configuration definition minimum and maximums plus constraints
imposed by partition aggregation.Variable Processor Capacity Weight: The unitless factor that the
hypervisor uses to assign processor capacity in addition to the Entitled
Processor Capacity Percentage. This factor may take the values 0 to 255.
A virtual processor’s time slice may be extended to allow it to use
capacity unused by other partitions, or not needed to meet the Entitled
Processor Capacity Percentage of the active partitions. A partition is
offered a portion of this variable capacity kitty equal to: (Variable
Processor Capacity Weight for the partition) / (summation of Variable
Processor Capacity Weights for all competing partitions). The initial
value of this parameter is determined by the partition configuration
definition. The H_SET_PPP hcall() allows the OS image to set this
parameter within the constraints imposed by the partition configuration
definition maximum. Certain partition definitions may not allow any
variable processor capacity allocation.Unallocated Processor Capacity Percentage: The amount of processor
capacity that is currently available within the constraints of the LPAR's
current environment for allocation to Entitled Processor Capacity
Percentage. Race conditions may change the current environment before a
request for this capacity can be performed, resulting in a constrained
return from such a request.Unallocated Variable Processor Capacity Weight: The amount of
variable processor capacity weight that is currently available within the
constraints of the LPAR's current environment for allocation to the
partition's variable processor capacity weight. Race conditions may
change the current environment before a request for this capacity can be
performed, resulting in a constrained return from such a request.System Parameters readable via the
ibm,get-system-parameter RTAS call (see
)
communicate a variety of configuration and
constraint parameters among which are determined by the partition
definition.By means that are beyond the scope of this architecture, various
partitions may be organized into aggregations, for example “LPAR
groups”, for the purposes of load balancing. These aggregations may
impose constraints such as: “The summation of the minimum available
capacity for all virtual processors supported by the LPAR group cannot
exceed 100% of the group’s configured capacity”.R1--1.For the SPLPAR option:
The platform must dispatch each partition virtual processors each dispatch cycle unless
prevented by the semantics of the H_CONFER hcall().R1--2.For the SPLPAR option:
The summation of the processing capacity that the platform dispatches to the virtual
processors of each partition must be at least equal to that partition's
Entitled Processor Capacity Percentage unless prevented by the semantics
of the H_CONFER and H_CEDE hcall()s.R1--3.For the SPLPAR option:
The processing capacity that the platform dispatches to each of the partition's virtual
processors must be substantially equal unless prevented by the semantics
of the H_CONFER and H_CEDE hcall()s.R1--4.For the SPLPAR option: The platform must distribute
processor capacity allocated to SPLPAR virtual processor actualization
not consumed due to Requirements
,
, and
to partitions in strict
accordance with the definition of Variable Processor Capacity Weight
unless prevented by the LPAR's definition (capped) or the semantics of
the H_CONFER and H_CEDE hcall()s.Note: A value of 0 for a Variable Processor Capacity Weight
effectively caps the partition at its Entitled Processor Capacity
Percentage value.R1--5.For the SPLPAR option on platforms: The platform must
increment the counters in VPA offsets 0x158-0x16F per their definitions
in
.R1--6.For the SPLPAR option on platforms : To maintain
compatibility across partition migration and firmware version levels the
OS must be prepared for platform implementations that do not increment
VPA offsets 0x158 - 0x16F.H_REGISTER_VPARegister Virtual Processor Areas (these include the parameter area
known as the VPA, the Dispatch Trace Log Buffer, and if the SLB-Buffer
function set is supported, the SLB Shadow Buffer). Note if the caller
makes multiple registration requests for a given per virtual processor
area for a given virtual processor, the last registration wins, and if
the same memory area is registered for multiple processors, the area
contents are unpredictable, however, LPAR isolation is not
compromised.The syntax of the H_REGISTER_VPA hcall() is given below.Syntax:Semantics:wVerify that the flags parameter is a supported value else return
H_Parameter. (That the subfunction field (Bits 16-23) is one of the
values supported by this call. Optionally that all other bits are zero.
Callers should not set any bits other than those specifically defined,
however, implementations are not required to check the value of bits
outside of the subfunction field.)Verify that the proc-no parameter references a virtual processor
owned by the calling virtual processor’s partition else return
H_ParameterIf the sub function is a register, verify that the addr parameter
is an L1 cache line aligned logical address within the memory owned by
the calling virtual processor’s partition else return
H_Parameter.If the Shared Logical Resource option is implemented and the addr
parameter represents a shared logical resource location that has been
rescinded by the owner, return H_RESCINDED.Case on subfunction in flags parameter:Register VPA:Verify that the size field (2 bytes) at offset 0x4 is at least
640 bytes else return H_Parameter.Verify that the entire structure (per the size field and vpa)
does not span a 4096 byte boundary else return H_Parameter.Record the specified processor’s vpa logical address for
access by other SPLPAR hypervisor functions.Initialize the contents of the area per
.Return H_SuccesRegister Dispatch Trace Log Buffer:Verify that the size field (4 bytes) at offset 0x4 is at least 48
bytes else return H_Parameter.For the CMO option, verify that the entire structure (per the
size field and vpa parameter) does not span a memory entitlement granule
boundary else return H_MLENGTH_PARM.Verify that a VPA has been registered for the specified virtual
processor else return H_RESOURCE.Initialize the specified processor’s preempt/dispatch trace
log buffer pointers and index.Return H_Success.Register SLB Shadow Buffer (if SLB-Buffer function set is
supported):Verify that the size field (4 bytes) at offset 0x4 is at least 8
bytes and that the entire structure (per the size and vpa parameters)
does not span a 4096 byte boundary else return H_Parameter.Verify that a VPA has been registered for the specified virtual
processor else return H_RESOURCE.Initialize the specified processor’s SLB Shadow buffer
pointers and set the maximum persistent SLB restore index to the lesser
of the maximum number of processor SLBs or the maximum number of entries
in the registered SLB Shadow buffer.Return H_Success.Deregister VPA:Verify that a Dispatch Trace Log buffer is not registered for the
specified processor else return H_RESOURCE.Verify that an SLB Shadow buffer is not registered for the
specified processor else return H_RESOURCE.Clear any partition memory pointer to the specified
processor’s VPA (note no check is made that a valid VPA
registration exists).Return H_Success.Deregister Dispatch Trace Log Buffer:Clear any partition memory and/ or hypervisor pointer to the
specified processor’s Dispatch Trace Buffer (note no check is made
that a valid Dispatch Trace Buffer registration exists).Return H_Success.Deregister SLB Shadow Buffer (if SLB-Buffer function set is
supported):Clear any hypervisor pointer(s) to the specified
processor’s SLB Shadow buffer (note no check is made that a valid
SLB Shadow buffer registration exists).Zero the hypervisor’s maximum persistent SLB restore index
for the specified processor.Return H_Success.Else Return H_Function.R1--1.For the SPLPAR option: The platform must implement
the H_REGISTER_PVA hcall() following the syntax and semantics of
.R1--2.For the SLPAR plus SLB Shadow Buffer options: The
platform must register, and deregister the optional SLB Shadow buffer per
the syntax and semantics of
.R1--3.For the SLPAR plus SLB Shadow Buffer options: The
platform must make persistent the SLB entries recorded by the OS within
the SLB Shadow buffer as described in
.H_CEDEThe architectural intent of this hcall() is to have the virtual
processor, which has no useful work to do, enter a wait state ceding its
processor capacity to other virtual processors until some useful work
appears, signaled either through an interrupt or a prod hcall(). To help
the caller reduce race conditions, this call may be made with interrupts
disabled but the semantics of the hcall() enable the virtual
processor’s interrupts so that it may always receive wake up
interrupt signals. As a hint to the hypervisor, the cede latency
specifier
indicates how long the OS can
tolerate the latency to an H_PROD hcall() or interrupt, this may affect
how the hypervisor chooses to use or even power down the actualizing
physical processor in the mean time.Software Note: The floating point registers may not
be preserved by this call if the “Maintain FPRs” field of the
VPA =0, see
.Syntax:Semantics:Enable the virtual processor’s MSREE
bit as if it was on at the time of the call.Serialize for the virtual processor’s control structure
with H_PROD.If the virtual processor’s “prod” bit is set,
then:Reset the virtual processor’s “prod”
bit.Release the virtual processor’s control structure.Return H_Success.Record all the virtual processor architected state including the
Time Base and Decrementer values.Set hypervisor timer services to wake the virtual processor per
the setting of the decrementer.Mark the virtual processor as non-dispatchable until the
processor is the target of an interrupt (system reset, external including
decrementer or IPI) or PROD.Cede the time remaining in the virtual processor’s time
slice preferentially to the virtual processor’s partition.Release the virtual processor’s control structure.Dispatch some other virtual processorReturn H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_CEDE hcall() following the syntax and semantics of
.H_CONFERThe architectural intent of this hcall() is to confer the callers
processor capacity to the holder of a lock or the initiator of an event
that the caller is waiting upon. If the caller knows the identity of the
lock holder then the holder’s virtual processor number is supplied
as a parameter, if the caller does not know the identity of the lock
holder then the “all processors” value of the proc parameter
is specified. If the caller is conferring to the initiator of an event
the proc parameter value of the calling processor. This call may be made
with interrupts enabled or disabled. This call provides a reduced
“kill set” of volatile registers, GPRs r0 and r4-r13 are
preserved.Software Note: The floating point registers may not
be preserved by this call if the “Maintain FPRs” field of the
VPA =0, see
.Syntax:Semantics:Validate the proc number else return H_Parameter. Valid
Values:-1 (all partition processors)0 through N one of the processor numbers of the calling processor's
partitionThe calling processor's number forces a confer until the calling
processor is PRODedIf the proc number is for a single processor and the single
processor is not the calling processor, thenIf the dispatch parameter is not equal to the specified
processor’s hypervisor copy of the dispatch number or the
hypervisor copy of the dispatch number is even, then return
H_Success.If the target processor has conferred its cycles to all others,
then return H_Success.Firmware Implementation Note: If one were to confer
to a processor that had conferred to all, then a dead lock could occur,
however, there are valid cases with nested locks were this could happen,
therefore, the hypervisor call silently ignores the confer.Record all the virtual processor architected state including the
Time Base and Decrementer values.If the MSREE bit is on,
set hypervisor timer services to wake
the virtual processor per the setting of the decrementer.Mark the virtual processor as non-dispatchable until one of the
following:System reset interrupt.The MSREE bit is
on and the virtual processor is the target
of an external interrupt (including decrementer or IPI).The virtual processor is the target of a PROD operation.The specified target processor (or all partition processors if
the proc parameter value is a minus 1) have had the opportunity of a
dispatch cycle.Confer the time remaining in the virtual processor’s time
slice to the virtual processor’s partition.Dispatch the/a partition target virtual processor.Return H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_CONFER hcall() following the syntax and semantics of
.R1--2.For the SPLPAR option: The platform must implement
the H_CONFER hcall() such that the only GPR that is modified by the call
is r3.H_PRODAwakens the specific processor. This call provides a reduced
“kill set” of volatile registers, GPRs r0 and r4-r13 are
preserved.Syntax:Semantics:Verify that the target virtual processor specified by the proc
parameter is owned by the calling virtual processor’s
partition.Serialize for the Target Virtual Processor’s control
structure with H_CEDE.Set “prod” bit in the target virtual
processor’s control structure.If the target virtual processor is not ready to run, mark the
target virtual processor ready to run.Release the target virtual processor’s control
structure.Return H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_PROD hcall() following the syntax and semantics of
.R1--2.For the SPLPAR option: The platform must implement
the H_PROD hcall() such that the only GPR that is modified by the call is
r3.H_GET_PPPThis hcall() returns the partition’s performance parameters.
The parameters are packed into registers:Register R4 contains the Entitled Processor Capacity Percentage
for the partition. In the case of a dedicated processor partition this
value is 100* the number of processors owned by the partition.Register R5 contains the Unallocated Processor Capacity
Percentage for the calling partition’s aggregation.Register R6 contains the aggregation numbers of up to 4 levels of
aggregations that the partition may be a member.Bytes 0-1: Reserved for future aggregation definition, and set to
zero -- in the future this field may be given meaning.Bytes 2-3: Reserved for future aggregation definition, and set to
zero -- in the future this field may be given meaning.Bytes 4-5: 16 bit binary representation of the “Group
Number”.Bytes 6-7: 16 bit binary representation of the “Pool
Number”. In the case of a dedicated processor partition the
“Pool Number” is not applicable which is represented by the
code 0xFFFF.Register R7 contains the platform resource capacities:Bytes 0 Reserved for future platform resource capacity
definition, set to zero -- in the future this field may be given
meaning.Byte 1 is a bit field representing the capping mode of the
partition’s virtual processor(s):Bits 0-6 are reserved, and set to zero -- in the future these
bits may be given meaning as new capping modes are definedBit 7 -- The partition’s virtual processor(s) are capped at
their Entitled Processor Capacity Percentage. In the case of dedicated
processors this bit is set.Byte 2: Variable Processor Capacity Weight. In the case of a
dedicated processor partition this value is 0x00.Byte 3: Unallocated Variable Processor Capacity Weight for the
calling partition’s aggregation.Bytes 4-5 16 bit unsigned binary representation of the number of
processors active in the caller’s Processor Pool. In the case of a
dedicated processor partition this value is 0x00.Bytes 6-7 16 bit binary representation of the number of
processors active on the platform.When the value of the
“ibm,partition-performance-parameters-level” see
) is >=1 then register R8 contains
the processor virtualization resource allocations. In the case of a
dedicated processor partition R8 contains 0:Bytes 0-1: 16 bit unsigned binary representation of the number of
physical platform processors allocated to processor
virtualization.Bytes 2-4: 24 bit unsigned binary representation of the maximum
processor capacity percentage that is available to the partition's
pool.Bytes 5-7: 24 bit unsigned binary representation of the entitled
processor capacity percentage available to the partition's pool.Syntax:Semantics:Place the partition’s performance parameters for the
calling virtual processor’s partition into the respective
registers:R4: The calling partition’s Entitled Processor Capacity
PercentageR5: The calling partition’s aggregation’s Unallocated
Processor Capacity Percentage.R6: The aggregation numbersR7: The platform resource capacitiesR8: When
“ibm,partition-performance-parameters-level” is
>= 1 in the device tree, R8 is loaded with the processor
virtualization resource allocationsReturn H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_GET_PPP hcall() following the syntax and semantics of
.H_SET_PPPThis hcall() allows the partition to modify its entitled processor
capacity percentage and variable processor capacity weight within limits.
If one or both request parameters exceed the constraints of the calling
LPAR’s environment, the hypervisor limits the set value to the
constrained value and returns H_Constrained. The H_GET_PPP call may be
used to determine the actual current operational values. By the
hypervisor constraining the actual values, the calling partition does not
need special authority to make the H_SET_PPP hcall().See
for definitions of these
values.Syntax:Semantics:Verify that the variable processor capacity weight is between 0
and 255 else return H_Parameter.Verify that the capacities specified is within the constraints of
the partition:If yes, atomically set the partition’s entitled and
variable capacity per the request and return H_Success.If not set the partition’s entitled and variable capacity
as constrained by the partition’s configuration and return
H_Constrained.Firmware Implementation Note: If the dispatch algorithm requires
that the summation of variable capacities be updated, it is atomically
updated with the set of the partition’s weight.R1--1.For the SPLPAR option: The platform must implement
and make available to selected partitions, the H_SET_PPP hcall()
following the syntax and semantics of
.H_PURRThe Processor Utilization of Resources Register (PURR) is
compatibly read through the H_PURR hcall(). In those implementations
running on processors that do not implement the register in hardware,
firmware simulates the function. On platforms that present the property
“ibm,rks-hcalls” with bit 2 set (see
), this call provides a reduced
“kill set” of volatile registers, GPRs r0 and r5-r13 are
preserved.Syntax:Semantics:If the platform presents the
“ibm,rks-hcall” property with bit 2 set;
then honor a kill set of volatile registers r3 & r4.Compute the PURR value for the calling virtual processor up to
the current point in time and place in R4Return H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_PURR hcall() following the syntax and semantics of
.H_POLL_PENDINGCertain implementations of the hypervisor steal processor cycles to
perform administrative functions in the background. The purpose of the
H_POLL_PENDING hcall() is to provide a OS, running atop such an
implementation, with a hint of pending work so that it may more
intelligently manage use of platform resources. The use of this call by
an OS is totally optional since such an implementation also uses hardware
mechanisms to ensure that the required cycles can be transparently
stolen. It is assumed that the caller of H_POLL_PENDING is idle, if all
threads of the processor are idle (as indicated by the idle flag at byte
offset 0xFE of
), the hypervisor may choose to
perform a background administrative task. The hypervisor returns
H_PENDING if there is pending administrative work, at the time of the
call, that it could dispatch to the calling processor if the calling
processor were ceded, if there is no such pending work, the return code
is H_Success. Due to race conditions, this pending work may have grown or
disappeared by the time the calling OS makes a subsequent H_CEDE
call.There is NO architectural guarantee that ceding a processor exempts
a virtual processor from preemption for a given period of time. That may
indeed be the characteristic of a given implementation, but cannot be
expected from all future implementations.Syntax:Semantics:Return H_PENDING if there is work pending that could be
dispatched to the calling processor if it were ceded, else return
H_Success.R1--1.For the SPLPAR option: The platform must implement
the H_POLL_PENDING hcall() following the syntax and semantics of
.Pool Idle Count Function SetThe hcall-pic function set may be configured via the partition
definition in none or any number of partitions as the weights
administrative policy dictates.H_PICSyntax:Semantics:Verify that calling partition has the authority to make the call
else return H_Authority.Compute the PIC value for the processor pool implementing the
calling virtual processor up to the current point in time and place into
R4Place the number of processors in the caller’s processor
pool in R5.When the value of the
“ibm,partition-performance-parameters-level” (see
) is >=1 then:Place the summation of time base ticks for all platform
processors, allocated to the caller's processor pool, into register
R6.Place the summation of all PURR ticks accumulated by all
dispatched (not idle) platform processor threads, allocated to the
caller's processor pool, into register R7.Place the summation of all SPURRMachines that do not have a SPURR mechanism are assumed to run at
a constant speed, at which time the PURR value is substituted. ticks accumulated by all dispatched (not idle) platform
processor threads, allocated to the caller's processor pool, into
register R8.Place the caller's processor pool ID into low order two bytes of
register R9 (high order 6 bytes are reserved - set to 0x000000).If the calling partition has the authority to monitor total
processor virtualization then:Place the summation of time base ticks for all platform physical
processors, allocated to processor virtualization, in register
R10.Place the summation of all PURR ticks accumulated by all
dispatched (not idle) platform physical processor threads, allocated to
processor virtualization, in register R11.Place the summation of all SPURR ticks accumulated by all dispatched (not idle) platform
physical processor threads, allocated to processor virtualization, in
register R12.Else load R10, R11 and R12 with -1.Return H_Success.R1--1.For the SPLPAR option: The platform must implement
and make available to selected partitions, the H_PIC hcall() following
the syntax and semantics of
.Thread Join OptionH_JOINThe H_JOIN hcall() performs the equivalent of a H_CONFER
(proc=self) hcall() (see
) unless called by the sole
unjoined (active) processor thread, at which time the H_JOIN hcall()
returns H_CONTINUE. H_JOIN is intended to establish a single threaded
operating environment within a partition; to prevent external interrupts
from complicating this environment, H_JOIN returns “bad_mode”
if called with the processor MSR[EE] bit set to 1. Joined (inactive)
threads are activated by H_PROD (see
) which starts execution at the
instruction following the hcall; or a system reset non-maskable interrupt
which appears to interrupt between the hcall and the instruction
following the hcall.Syntax:SemanticsIf MSR.EE=1 return bad_mode.If other processor threads are active in the calling partition,
then emulate H_CONFER (proc=self)Else return H_CONTINUE.R1--1.For the Thread Join option: The platform must
implement the H_JOIN hcall() following the syntax and semantics of
.R1--2.For the Thread Join option: The platform must
implement the hcall-join and hcall-splpar function sets.R1--3.For the Thread Join option: The platform must support
the H_PROD hcall even if the partition is operating in dedicated
processor mode.Virtual Processor Home Node Option (VPHN)The SPLPAR option allows the platform to dispatch virtual
processors on physical processors that due to the variable nature of work
loads are temporarily free, thus improving the utilization of computing
resources. However, SPLPAR implies inconsistent mapping of virtual to
physical processors; defeating resource allocation software that attempts
to optimize performance on platforms that implement the NUMA
option.To bridge the gap between these two options, the VPHN option
maintain a substantially consistent mapping of a given virtual processor
to a physical processor or set of processors within a given associativity
domain. Thus the OS can, when allocating computing resources, take
advantage of this statistically consistent mapping to improve processing
performance.VPHN mappings are substantially consistent but not static. For any
given dispatch cycle, a best effort is made to dispatch the virtual
processor on a physical processor within a targeted associativity domain
(the virtual processor's home node). However, if processing capacity
within the home node is not available, some other physical processor is
assigned to meet the processing capacity entitlement. From time to time,
to optimize the total platform performance, it may be necessary for the
platform to change the home node of a given virtual processor.To enable the OS to determine the associativity domain of the home
node of a virtual processor, platforms implementing the VPHN option
provide the H_HOME_NODE_ASSOCIATIVITY hcall(). The presence of the
hcall-vphn function set in the
“ibm,hypertas-functions” property
indicates that the platform supports the VPHN option. The OS should be
prepared for the support of the VPHN option to change with functions such
partition migration, after which a call to H_HOME_NODE_ASSOCIATIVITY may
end with a return code of H_FUNCTION. Additionally, the VPHN option
defines a VPA field that the OS can poll to determine if the
associativity domain of the home node has changed. When the home node
associativity domain changes, the OS might choose to call the
H_HOME_NODE_ASSOCIATIVITY hcall() and adjust its resource allocations
accordingly.R1--1.For the Virtual Processor Home Node option: The
platform must support the H_HOME_NODE_ASSOCIATIVITY hcall() per the
syntax and semantics specified in section
.R1--2.For the Virtual Processor Home Node option: For the
OS to operate properly across such functions as partition migration, the
OS must be prepared for the target platform to not support the Virtual
Processor Home Node option.R1--3.For the Virtual Processor Home Node option: The
platform must support the “virtual processor home node
associativity changes counters” field in the VPA per section
.R1--4.For the Virtual Processor Home Node option: The
platform must support the “Form 1” of the
“ibm,associativity-reference-points” property per
. The client program may call
H_HOME_NODE_ASSOCIATIVITY hcall() with a valid identifier input parameter
(such as from the device tree or from the
ibm,configure-connector RTAS call) even if the
corresponding virtual processor has not been started so that the client
program can allocate resources optimally with respect to the to be
started virtual processor.H_HOME_NODE_ASSOCIATIVITYThe H_HOME_NODE_ASSOCIATIVITY hcall() returns the associativity
domain designation associated with the identifier input parameter. The
client program may call H_HOME_NODE_ASSOCIATIVITY hcall() with a valid
identifier input parameter (such as from the device tree or from the
ibm,configure-connector RTAS call) even if the
corresponding virtual processor has not been started so that the client
program can allocate resources optimally with respect to the to be
started virtual processor.Syntax:Parameters:Input:flags:Note: this parameter does not share format with the flags
parameter of the Page Frame Table Access hcall()s.Defined Values:0x0 Invalid0x1 id parameter is as proc-no parameter of H_REGISTER_VPA
hcall()0x2 id parameter is as processor index from byte offsets 0x2-0x3
of a trace log buffer entryall other values reserved.id: processor identifier per the form indicated by the flags
parameter.Output:R3: return codeR4-R9: associativity domain identifier list of the specified
processor’s home node.Only the “primary” connection (as would be reported
in the first string of the
“ibm,associativity” property) is
reported.The associativity domain numbers are reported in the sequence
they would appear in the
“ibm,associativity” property; starting
from the high order bytes of R4 proceeding toward the low order bytes of
R9.Each of the registers R4-R9 is divided into 4 fields each 2 bytes
long.The high order bit of each 2 byte field is a length
specifier:1: The associativity domain number is contained in the low order
15 bits of the field,0: The associativity domain number is contained in the low order
15 bits of the current field concatenated with the 16 bits of the next
sequential field)All low order fields not required to specify the associativity
domain identifier list contain the reserved value of all ones.Semantics:Verify that the “flags” parameter is valid else
return H_Parameter.Verify that the “id” parameter is valid for the
“flags” and the partition else return H_P2.Pack the associativity domain identifiers for the home node
associated with the “id” parameter starting with the highest
level reported in the
“ibm,associativity” property in the high
order field of R4.All remaining fields through the low order field of R9 are filled
with 0xFFFFFFFF.Return H_Success.VPA Home Node Associativity Changes CountersFor the VPHN option, the platform maintains within each VPA the
Virtual Processor Home Node Associativity Change Counters field. See
Table
. This eight (8) byte field is
maintained as 8 one byte long counters. The number of counters that are
supported is implementation dependent up to 8, and corresponds to the
entries in the form 1 of the
“ibm,associativity-reference-points” property. If
the platform implements fewer than 8 associativity reference points, only
the corresponding low offset counters within the field are used and the
remaining high offset counters within the field are unused.Should the associativity of the home node of the virtual processor
change, for each changed associativity level that corresponds to a level
reported in the
“ibm,associativity-reference-points” property, the
corresponding counter in the Virtual Processor Home Node Associativity
Change Counters field is incremented.Virtualizing Partition MemoryThis section describes the various high level functions that are
enabled by the virtualization of the logical real memory of a partition. In
principle, virtualization of partition memory can be totally transparent to
the partition software; however, partition software that is migration aware
can cooperate with the platform to achieve higher performance, and enhanced
functionality.Partition Migration/HibernationVirtualizing partition memory allows a partition to be moved via
migration or hibernation. In the case of partition migration from one
platform to another, the source and destination platforms cooperate to
minimize the time that the partition is non-responsive; the goal is to be
non-responsive no more than a few seconds. In the case of hibernation,
the intent is to put the partition to sleep for an extended period;
during this time the partition state is stored on secondary storage for
later restoration.R1--1.For the Partition Migration and Partition Hibernation
options: The platform must implement the Partition Suspension
option (See
).R1--2. For the Partition Migration and Partition
Hibernation options: The platform must implement the VASI
option (See ).
R1--3.For the Partition Migration and Partition Hibernation
options: The platform must implement the Update OF Tree
option.R1--4.For the Partition Migration and Partition Hibernation
options: The platform must implement the Version 6 Extensions
of Event Log Format for all reported events (See
).R1--5.For the Partition Migration and Partition Hibernation
options: The platform must prevent the migration/hibernation of
partitions that own dedicated platform resources in addition to
processors and memory, this includes physical I/O resources, the BSR
facility, physical indicators and sensors (virtualized I/O, indicators
(such as tone) and sensors (such as EPOW) are allowed).R1--6.For the Partition Migration and Partition Hibernation
options: The platform must implement the Client Vterm
option.R1--7.For the Partition Migration and Partition Hibernation
options: The platform
“timebase-frequency” must be 512 MHz. +/-
50 parts per million.R1--8.For the Partition Migration and Partition Hibernation
options: The platform must present the
“ibm,nominal-tbf” property (See
) with the value of 512 MHz.R1--9.For the Partition Suspension option: The platform
must present the properties from
, as specified by
, to a partition.R1--10.For the Partition Suspension option: The presence and
value of all properties in
must not change while a
partition is suspended except for those properties described by
.
Properties Related to the Partition Suspension OptionProperty NameRequirement“ibm,estimate-precision”Shall be present.
“ibm,estimate-precision” shall
contain the “fre”, “fres”,
frsqrte”, and “frsqrtes” instruction
mnemonics.“ibm,processor-page-sizes”Shall be present.“reservation-granule-size”Shall be present.“cache-unified”Shall be present if the cache is physically or logically
unified and thus does not require the architected instruction
sequence for data cache stores to appear in the instruction
cache (See “Instruction Storage” section of Book II
of PA); else shall not be present.“i-cache-size”Shall be present.“d-cache-size”Shall be present.“i-cache-line-size”Shall be present.“d-cache-line-size”Shall be present.“i-cache-block-size”Shall be present.“d-cache-block-size”Shall be present.“i-cache-sets”Shall be present.“d-cache-sets”Shall be present.“timebase-frequency”Shall be present if the timebase frequency can fit into
the
“timebase-frequency” property;
else shall not be present.“ibm,extended-timebase-frequency”Shall be present if the timebase frequency cannot fit
into the
“timebase-frequency” property; else
shall not be present.“slb-size”Shall be present.“cpu-version”Shall be present.“ibm,ppc-interrupt-server#s”Shall be present.“l2-cache”Shall be present if another level of cache exists; else
shall not be present.“ibm,vmx”Shall be present if VMX is present for the partition;
else shall not be present.“clock-frequency”Shall be present if the processor frequency can fit into
the
“clock-frequency” property;
else shall not be present.“ibm,extended-clock-frequency”Shall be present if the processor frequency cannot fit
into the
“clock-frequency” property; else shall
not be present.“ibm,processor-storage-keys”Shall be present.“ibm,processor-vadd-size”Shall be present.“ibm,processor-segment-sizes”Shall be present.“ibm,segment-page-sizes”Shall be present.“64-bit”Shall be present.“ibm,dfp”Shall be present if DFP is present for the partition;
else shall not be present.“ibm,purr”Shall be present if a PURR is present; else shall not be
present.“performance-monitor”Shall be present if a Performance Monitor is present;
else shall not be present.“32-64-bridge”Shall be present.“external-control”Shall not be present.“general-purpose”Shall be present.“graphics”Shall be present.“ibm,platform-hardware-notification”Shall be present.“603-translation”Shall not be present.“603-power-management”Shall not be present.“tlb-size”Shall be present.“tlb-sets”Shall be present.“tlb-split”Shall be present.“d-tlb-size”Shall be present.“d-tlb-sets”Shall be present.“i-tlb-size”Shall be present.“i-tlb-sets”Shall be present.“64-bit-virtual-address”Shall not be present.“bus-frequency”Shall be present if the bus frequency can fit into the
“bus-frequency” property; else
shall not be present.“ibm,extended-bus-frequency”Shall be present if the processor frequency cannot fit
into the
“bus-frequency” property; else shall
not be present.“ibm,spurr”Shall be present if an SPURR is present; else shall not
be present.“name”Shall be present.“device_type”Shall be present.“reg”Shall be present.“status”Shall be present.“ibm,pa-features”Shall be present.“ibm,negotiated-pa-features”Shall be present“ibm,ppc-interrupt-gserver#s”Shall be present“ibm,tbu40-offset”Shall be present“ibm,pi-features ”Shall be present“ibm,pa-optimizations”Shall be present
Note on : The values of the
properties in Table
shall be consistent with
implementation and design of the processor and the platform upon boot as
well as before and after partition suspension.Programming Note: The
“cpu-version” property may contain a
logical processor version value. Therefore, code designed to handle
processor errata should read the
“ibm,platform-hardware-notification” property of
the root node to obtain the physical processor version numbers allowed in
the platform.Virtualizing the Real Mode AreaPA requires implementations to provide a Real Mode Area of memory
that is accessed when not in hypervisor state (either MSR[HV] = 0, or
MSR[HV] = 1 and MSR[PR] = 1) and the OS address translation mechanism is
disabled (MSR[IR] = 0 or MSR[DR] = 0). PA provides mechanisms to allow
the RMA to consist of discontiguous pages of selectable sizes. Such an
RMA is known as a virtualized RMA. The H_VRMASD hcall() allows the OS to
change the characteristics of the mappings the address translation
mechanism uses to access a virtualized RMA.H_VRMASDThe caller may need to invoke the H_VRMASD hcall() multiple times
for it to return with a return code of H_Success. Upon receiving a return
code of H_LongBusyOrder10mSec, the caller should attempt to invoke
H_VRMASD in 10 mSec with the same Page_Size_Code value used on the
previous H_VRMASD hcall(). Invoking H_VRMASD with a different
Page_Size_Code value indicates that the caller wants to transition to the
Page_Size_Code value of the most recent H_VRMASD call.When changing the page size used to map the VRMA using the H_VRMASD
hcall(), the caller is responsible for establishing HPT entries for any
potential real mode accesses prior to calling H_VRMASD with a new value
of Page_Size_Code, and maintaining any HPT entries for the old value of
Page_Size_Code until the hcall() returns H_Success.R1--1.For the VRMA option: The platform must include the
“ibm,vrma-page-sizes” property (See
) in the
/cpu node.R1--2.For the VRMA option: The platform must implement the
H_VRMASD hcall() following the syntax and semantics of
.R1--3.For the VRMA option: In order to prevent a storage
exception, the calling partition must establish page table mappings for
the Real Mode Area using entries with a page size corresponding to the
new Page_Size_Code value prior to making an H_VRMASD hcall() and must
maintain the old page table mappings using the page size corresponding to
the old Page_Size_Code value until the H_VRMASD hcall() returns
H_Success.Syntax:Parameters:Page_Size_Code: A supported VRMASD field value. Supported VRMASD
field values are described by the
“ibm,vrma-page-sizes” property.Semantics:Verify that the Page_Size_Code parameter corresponds to a
supported VRMASD field value; else return H_Parameter.If the Real Mode Area page size specified by the Page_Size_Code
parameter does not match the operating RMA page size of the partition,
then set the operating RMA page size of the partition to the value
specified by the Page_Size_Code parameter and initiate the transition of
the operating RMA page size of all active processing threads to the value
specified by the Page_Size_Code parameter.If all active threads have transitioned to the partition
operating RMA page size, then return H_Success; else return
H_LongBusyOrder10mSec.Cooperative Memory Over-commitment Option (CMO)The over-commitment of logical memory is accomplished by the
platform reassigning pages of memory among the partitions to create the
appearance of more memory than is actually present. This is commonly
known as paging. While paging can, in certain cases, be accomplished
transparently, significantly better memory utilization and platform
performance can be achieved with cooperation from the partition
OS.CMO introduces the following LoPAR terms:Expropriation:The act of the platform disassociating a physical
page from a logical page.Subvention:The act of the platform associating a physical page
with a logical page.Loaned Memory:Logical real memory that a partition lends to the
hypervisor for reuse. The partition should not gratuitously access loaned
memory as such accesses are likely to experience a significant
delay.Memory entitlement:The amount of memory that the platform
guarantees that the partition is able to I/O map at any given
time.R1--1.For the CMO option: The partition must be running
under the SPLPAR option.R1--2.For the CMO option: The platform must transparently
(except for time delays) handle all effects of any memory expropriation
that it may introduce unless the CMO option is explicitly enabled by the
setting of architecture.vec option vector 5 byte 4 bit 0 (See
for details).The CMO option consists of the following LoPAR extensions:Define
ibm,architecture.vec-5 option Byte 4 bit 0 as
“Client supports cooperative logical real memory
over-commitment”.Define page usage states to assist the platform in selecting good
victim pages and mechanisms to set such states.Extend the syntax and semantics defined for the I/O mapping
hcall()sReturn codes (H_LongBusyOrder1msec, H_LongBusyOrder10msec, and
H_NOT_ENOUGH_RESOURCES)Return parameter extension for memory entitlement
managementDefine a simulated Special Uncorrectable memory Error machine
check for the case where a page can not be restored due to an
error.R1--3.For the CMO option: The architected interface syntax
and semantics of all LoPAR hcall()s and RTAS calls except as explicitly
modified per the CMO option architecture must remain invariant when
operating in CMO mode; any accommodation to memory over-commitment by
these firmware functions (potentially any function that takes a logical
real address as an input parameter) is handled transparently.Note: Requirement
specifically applies to the
debugger support hcall()s.For maximum performance benefit, an OS that indicates via the
ibm,client-architecture-support interface that it
supports the CMO option will strive to maintain in the
“loaned” state (See
), the amount of logical memory
indicated by the value returned in R9 from the H_GET_MPP hcall (See
), as well as provide page
usage state information via the interfaces defined in
and
.The Extended Cooperative Memory Over-commitment Option (XCMO)
provides additional features to manage page coalescing. These features
are activated via setting architecture.vec vector 5 byte 4 bit 1 to the
value of 1 in the
ibm,client-architecture-support interface. Given that
the platform supports the XCMO option, the CC flag for page frame table
Accesses see
and the H_GET_MPP_X hcall() see
may be used by the OS. An OS
might understand that a given page is a great candidate for page
coalescing perhaps because the page contains OS and or common library
code which is likely to be duplicated in other partitions; if so it might
choose to set the Coalesce Candidate (CC) flag in the page table access
or H_PAGE_INIT hcall()s as a hint to the hypervisor. Should a given
logical page be mapped multiple times with conflicting Coalesce Candidate
hints, the value in the last mapping made takes precedence.For a variety of reasons outside the scope of LoPAR, a platform
supporting the XCMO option for a given platform might not actually
perform page coalescing. If this is the case, the first return value from
the H_GET_MPP_X hcall() see
is the reserved value
zero.R1--4.For the XCMO Option: The platform must implement the
CMO Option.R1--5.Reserved for Compatibility For the XCMO Option: The
platform must implement the CC (Coalesce Candidate) flag bit see
.R1--6.For the XCMO Option: The platform must implement the
H_GET_MPP_X hcall() see
.R1--7.For the XCMO Option with the Partition Migration and
Partition Hibernation options: to ensure proper operation after partition
migration or hibernation, the OS must stop setting the CC flag bit see
and stop calling the
H_GET_MPP_X hcall() see
prior to calling
ibm,suspend-me RTAS and not do so again until after
the OS has determined that the XCMO option is supported on the
destination platform.CMO Background (Informative)The following information is provided to be informative of the
architectural intent. Implementations may vary, but should make a best
effort to achieve the goals described.Ideally, the hypervisor does not expropriate any logical memory
pages that it must later read in from disk; this is based upon the belief
that the OS is in a better position to determine its working set relative
to the available memory and page its memory than the hypervisor thus,
when possible, the OS pager should be used. The ideal is approximated,
since it cannot be achieved in all cases. The “Overage” is
defined as the amount of logical address space that cannot be backed by
the physical main storage pool. The overage is equal to the summation of
the logical address space for all partitions using a given VRM main
storage pool (the main storage that the hypervisor uses to back logical
memory pages for a set of partitions) plus the high water mark of the
hypervisor free page list (the free list high water mark is some
implementation dependent ratio of the pool size) less the size of the VRM
main storage pool.If the summation of the space freed by page coalescing and page
donation is equal to the overage, in the steady state the hypervisor need
not page. In reality the system is seldom, if ever, in the steady state,
but with the free list pages the hypervisor has enough buffer space to
take up most of the transient cases.Page coalescing is a transparent operation where in the hypervisor
detects duplicate pages, directs all user reads to a single copy and may
reclaim the other duplicate physical memory pages. Should the page owner
change a coalesced page the hypervisor needs to transparently provide the
page owner with a unique copy of the page. Read only pages are more
likely to remain identical for a longer period of time and are thus
better coalescing candidates.To set the value for the partition's page donation, the algorithm
needs to be “fair” and responsive to the partition's
“weight” so that more important work can be helped along. To
be “fair”, the donation needs to be somewhat proportional to
the partition's size since donating x pages is likely to cause greater
pain to a small partition than a large one; yet the reason for
“weight” is to cause greater pain to certain partitions
relative to others.Thus the initial donation for a partition is set at the partition's
logical address space size as a percentage of the total pool logical
space subscription times the overage.Each implementation dependent time interval (say single digit
seconds or so), the hypervisor randomly selects 100 pages from each
partition and monitors how many of them were accessed during the next
interval. This, after normalization to account for partition CPU
utilization relative to its recent maximum, becomes an estimate of the
partition's page utilization. It is expected that a partition with higher
page utilization has a higher page fault rate and a lower percentage of
its working set resident -- thus experiences more pain from VRM.The page utilization method described above may over estimate
memory pressure in certain cases; specifically it may be slow to realize
that the partition has gone idle. An idle partition reduces its CPU
utilization which after normalization makes it appear that the partition
memory pressure has risen rather than lowered. For this reason, the
results of the page utilization method is further compared with the OS
reported count of faults against pages that were previously swapped out
as reported in offsets 0x180 - 0x183 of the VPA for each of the
partition processors. The partition fault count when normalized with
respect to processor cycles allows comparisons among the reported values
from other partitions. Since the partition fault count is OS reported,
and thus can not be trusted, it can not be the primary value used to
determine page allocation, but since if the OS is misreporting the
statistic, it is likely to be high, the memory pressure estimate derived
from the OS reported fault counts can be used to reduce (but not
increase) the partition memory allocation. Note since the hint might not
be reported by a given OS, a filter should be put in place to detect that
the OS is not reporting faults and appropriate default values
substituted.This initial donation is then modified over time to force the pain
of higher page utilization upon lower weight partitions based upon
comparing the following ratios:A: The average partition page utilization over the last interval of
all partitions in the pool / the partition's page utilization over the
last intervalB: The partition's weight/average partition weight of all
partitions in the poolIf A > B Increase the partition's donation by 1/256 of the
partition's logical address space (limited to the partition’s
logical address size)If A < B Reduce the partition's donation by 1/256 of the
partition's logical address space provided that the summation of all
donations >= Overage.The hypervisor maintains a per partition count of loaned pages
(incremented when a page is removed from the PFT with a
“loaned” state and decremented when/if the page state is
changed) thus it can keep track of how well a partition is doing against
the donation request that has been made of it. Partitions that do not
respond to donation requests need
to have their pages stolen to make up the difference. Pages that
are “unused” or “loaned” are automatically
applied to the free list. “Loaned” pages are expected to
raise the partition's free list low water mark so that the OS only
reclaims them in a transient situation which will then result in the OS
paging out some of its own virtual memory to restore the total donation
in the steady state. When the platform free list gets to the low water
mark, pages are expropriated starting with the partition that has the
greatest percentage discrepancy between its loaned plus expropriated
count and is donation tax. The algorithm used is implementation
dependent. The following is given for reference and is loosely based upon
the AIX method.For this algorithm, pages that are newly restored are marked as
“referenced” and all “unused” have already been
harvestedStep through the partition logical address space until either
the hypervisor free list has gotten to its high water mark or the
partition has been taxed to its donation.If the page is I/O mapped and not expropriatable, continue to
the next page.If this is the first pass through the address space on this
harvest, and the page is marked critical, continue to the next
page.If the page is marked “referenced”, clear the
reference bit and continue to the next page.If the page is backed in the VPM paging space and not modified
since then, expropriate the page and continue to the next page.Queue the page to be copied into the VPM paging space.Thus partitions that keep up with their page donations seldom, if
ever, experience a hypervisor page in delay. Those that do not keep up,
will not get a free lunch and will be taxed up to the value of their
assigned donation, with the real possibility that they will experience
the pain of hypervisor page in delays.CMO Page Usage StatesThe CMO option defines a number of page states that are set by the
cooperating OS using the flags parameter of the HPT hcall()s. The
platform uses these page states to estimate the overhead associated with
expropriating the specific page.Note: that the first two definitions below represent
base background page states; the 3rd definition is the foreground state
of I/O mapped which is acquired as result of an I/O mapping hcall (such
as H_PUT_TCE); and the last two are caller specifiable state
modifiers/extended semantics of the base states.UnusedThe page contains no information that is needed in
the future, its contents need not be maintained by the platform, normally
set only when the page is unmapped.Expropriation of “Unused” pages should be a low
overhead operation. However, the OS is likely to reuse these pages
which means that a clean free page will have to be assigned to the
corresponding logical address.ActiveThe page retains data that the OS has no
reasonable way to regenerate. This is the state traditionally assumed by
the OS when mapping a page.“Active” pages should be expropriated only as a last
resort since they must be paged out and paged back in on a subsequent
access.I/O MappedThe page is mapped for access by another
agent. This state is the side effect of registration and/or I/O mapping
functions. The page returns to its background state automatically when
unmapped or deregistered.Pages in the I/O Mapped state normally may not be expropriated
since they are potentially the target of physical DMA
operations.CriticalThe page is critical to the performance of the
OS, and the hypervisor should avoid expropriating such pages while other
pages are available.Expropriating pages marked “Critical” may result in
the OS being unable to meet its performance goals.LoanedThe page contains no information and the OS
warrants that it will not gratuitously access this page such that the
hypervisor may expect to use it for an extended period of time. When the
OS does access the page, it is likely that the access will result in a
subvention delay.Expropriating pages in the “Loaned” state should
result in the lowest overhead.R1--1.For the CMO option: The platform must at partition
boot initialize the page usage state of all platform pages to
“Active”.R1--2.For the CMO option: The platform must preserve data
in pages that are in the “Active” state.R1--3.For the CMO option: When the OS accesses a page in
the “Unused” state, the platform must present either the
preserved page data or all zeros.R1--4.For the CMO option: When the OS specifies as input to
an I/O mapping or the H_MIGRATE_DMA hcall() a page in either the
“Unused” or “Loaned” states, the platform must
upgrade the page’s background page state to
“Active”.Setting CMO Page Usage States using HPT hcall() flags
ParameterThe CMO option defines additional flags parameter combinations for
the HPT hcall()s that take a flags parameter. Turning on flags bit 28
activates the changing of page state. Leaving bit 28 at the legacy value
of zero maintains the page state setting, thus allowing legacy code to
operate unmodified with all pages remaining in the initialized
“Active” state.R1--1.For the CMO option: The platform must extend the
syntax and semantics of the HPT access hcall()s that take a flags
parameter, see
, to set the page usage state
of the specified page per
.
HPT hcall()s extended with CMO flagshcall
CMO Page Usage State flags DefinitionFlag bit 28Flag bits 29 - 30Flag bit 31Comments0Don’t CareDon’t CareInhibit Page State Change1000Set page state to Active001Set page State to Active Critical01both 0 and 1Reserved10both 0 and 1Reserved110Set page state to Unused111Set page state to Unused Loaned
Setting CMO Page Usage States with H_BULK_REMOVER1--1.For the CMO option: The platform must extend the
syntax and semantics of the H_BULK_REMOVE hcall (see
) to set the page usage state
of the specified pages per
.
H_BULK_REMOVE Translation Specifier control/status
Byte Extended Definition for CMO Option01234567Bit Numberstype code00r0r0r0r0r0r0Unused01page stater0r0req. mod.Request00Absolute01andcon10APVN11not used00Inhibit page usage state change01Reserved10For CMO option set page usage state to
“Unused” if Success11For CMO option set page usage state to
“Loaned” if Success10return codeResponse00RCrrSuccess01rrNot Found10H_PARM11H_HW11Reserved (to be zero)End of StringLegion R=Reference Bit, C=Change Bit, r=reserved ignore,
r0=reserved to be zero
CMO Extensions for I/O Mapping Hcall()sIf an OS were to map an excessive amount of its memory for
potential physical DMA access, little of its memory would be left for
paging; conversely, if the OS was totally prevented from I/O mapping its
memory, it could not do I/O operations. The CMO option introduces the
concept of memory entitlement. The partition’s memory entitlement
is the amount of memory that the platform guarantees that the partition
is able to I/O map at any given time. A given page may be mapped multiple
times through different LIOBNs yet it only counts once against the
partition’s I/O mapping memory entitlement. The syntax of certain
I/O mapping hcall()s is extended to return the change in the
partition’s I/O mapped memory total. The entitlement is intended to
be used to ensure forward progress of I/O operations.R1--1.For the CMO option: When the partition is operating
in CMO mode, the platform must extend the syntax and semantics of the I/O
mapping hcall()s specified in
as per the specifications in
and
.
I/O Mapping hcall()s Modified by the CMO
Option.hcall()Base Definition onH_PUT_TCEH_STUFF_TCEH_PUT_TCE_INDIRECT
Note: The I/O mapping hcalls H_PUT_RTCE and
H_PUT_RTCE_INDIRECT do not change the number of pages that are I/O mapped
since they simply create copies of the I/O mappings that already
exist.CMO I/O Mapping Extended Return CodesR1--1.For the CMO option: The platform must ensure that the
DMA agent operating through the I/O mappings established by the hcall()s
specified in
can appear to successfully
access the associated page data of any expropriated page referenced by
the input parameters of the hcall() prior to returning the code
H_Success.R1--2.For the CMO option: The platform must either extend
the return code set for the hcall()s specified in
to include H_LongBusyOrder1msec
and/or H_LongBusyOrder10msec or transparently suspend the calling virtual
processor for cases where the function is delayed pending the restoration
of an expropriated page.R1--3.For the CMO option: The platform must extend the
return code set for the hcall()s specified in
to include
H_NOT_ENOUGH_RESOURCES for cases where the function would cause more
memory to be I/O mapped than the caller is entitled to I/O map and the
platform is incapable of honoring the request.CMO I/O Mapping Extended Return ParameterThe syntax and semantics of the hcall()s in
are extended when the partition
is operating in CMO mode by returning in register R4 the change in the
partition’s total number of I/O mapped memory bytes due to the
execution of the hcall(). The number may be positive (increase in the
amount of memory mapped) negative or zero (the page was/remains mapped
for I/O access by another agent).R1--1.For the CMO option: The platform must extend the
syntax and semantics of the hcall()s specified in
when operating in CMO mode, to
return in register R4 the change to the total number of bytes that were
I/O mapped due to the hcall().H_SET_MPPThis hcall() sets, within limits, the partition’s memory
performance parameters. If the request parameter exceeds the constraint
of the calling LPAR’s environment, the hypervisor limits the value
set to the constrained value and returns H_Constrained. The memory weight
is architecturally constrained to be within the range of 0-255.Syntax:Semantics:Verify that the memory performance parameters specified are
within the constraints of the partition:If yes, atomically set the partition’s memory performance
parameters per the request and return H_Success.If not, set the partition’s memory performance parameters
as constrained by the partition’s configuration and return
H_Constrained.R1--1.For the CMO option: The platform must initially set
the partition memory performance parameters to their configured maximums
at partition boot time.R1--2.For the CMO option: The platform must implement the
H_SET_MPP hcall() following the syntax and semantics of
.R1--3.For the CMO option: The platform must constrain the
partition memory weight to the range 0-255.H_GET_MPPThis hcall() reports the partition’s memory performance
parameters. The returned parameters are packed into registers.Command OverviewSyntax:Semantics:Place the partition’s memory performance parameters for the
calling virtual processor’s partition into the respective
registers:R4: The number of bytes of main storage that the calling
partition is entitled to I/O map. In the case of a dedicated memory
partition this shall be the size of the partition’s logical address
space.R5: The number of bytes of main storage that the calling
partition has I/O mapped. In the case of a dedicated memory partition
this is not applicable which is represented by the code -1.R6: The calling partition’s virtual partition memory
aggregation identifier numbers, up to 4 levels:Bytes 0-1: Reserved for future aggregation definition, and set to
zero -- in the future this field may be given meaning.Bytes 2-3: Reserved for future aggregation definition, and set to
zero -- in the future this field may be given meaning.Bytes 4-5: 16 bit binary representation of the “Group
Number”.Bytes 6-7: 16 bit binary representation of the “Pool
Number”. In the case of a dedicated memory partition the
“Pool Number” is not applicable which is represented by the
code 0xFFFF.R7: Collection of short memory performance parameters for the
calling partition:Byte 0: Memory weight (0-255). In the case of a dedicated
processor partition this is not applicable which is represented by the
code 0.Byte 1: Unallocated memory weight for the calling
partition’s aggregation.Bytes 2-7: Unallocated I/O mapping entitlement for the calling
partition’s aggregation divided by 4096.R8: The calling partition’s memory pool main storage size
in bytes. In the case of a dedicated processor partition this is not
applicable which is represented by the code -1.R9: The signed difference between the number of bytes of logical
storage that are currently on loan from the calling partition and the
partition’s overage allotment (a positive number indicates a
request to the partition to loan the indicated number of bytes else they
will be expropriated as needed).R10: The number of bytes of main storage that is backing the
partition logical address space. In the case of a dedicated processor
partition this is the size of the partition’s logical address
space.Return H_Success.R1--1.For the CMO option: The platform must implement the
H_GET_MPP hcall() following the syntax and semantics of
.H_GET_MPP_XThis hcall() provides additional information over and above (not
duplication of) that which is returned by the H_GET_MPP hcall()
. The syntax of this hcall() is
specifically designed to be seamlessly extensible and version to version
compatible both from the view of the caller and the called on an
invocation by invocation basis. To this end, all return registers (R3
(return code) through R10) are defined from the outset, some are defined
as reserved and are set to zero upon return by the hcall(). The caller is
explicitly prohibited from assuming that any reserved register contains
the value zero, so that there will be no incompatibility with future
versions of the hcall() that return non-zero values in those registers.
New definitions for returned values will define the value zero to
indicate a benign or unreported setting.Syntax:Semantics:Place the partition’s extended memory performance
parameters for the calling virtual processor’s partition into the
respective registers:R4: The number of bytes of the calling partition’s logical
real memory coalesced because they contained duplicated data.R5: If the calling partition is authorized to see pool wide
statistics (set by means that are beyond the scope of LoPAR) then The
number of bytes of logical real memory coalesced because they contained
duplicated data in the calling partition’s memory pool else set to
zero.R6:: If the calling partition is authorized to see pool wide
statistics (set by means that are beyond the scope of LoPAR) then PURR
cycles consumed to coalesce data else set to zero.R7: If the calling partition is authorized to see pool wide
statistics (set by means that are beyond the scope of LoPAR) then SPURR
cycles consumed to coalesce data else set to zero.R8: If the calling partition is authorized to see pool
wide statistics (set by means that are beyond the scope of LoPAR)
then, the total number of the calling partition’s
memory pool bytes currently in use backing the pool's partition
logical memory (this value represents the net usage
after any and all savings from deduplication or any other future
means the hypervisor may employ) else set to 0.R9: Reserved shall be set to zero - shall not be read by
the callerR10: Reserved shall be set to zero - shall not be read by
the callerReturn H_Success:R1--1.For the XCMO option: If the platform coalesces memory
pages that contain duplicated data it must implement the H_GET_MPP_X
hcall() following the syntax and semantics of
.R1--2.For the XCMO option: the caller must be prepared for
H_GET_MPP_X to return H_Function or to have a return parameter that was
previously non-zero be consistently returned with the value zero if the
caller wishes to operate properly in a partition migration or fail-over
environment.Restoration Failure InterruptR1--1.For the CMO option: When the platform experiences an
unrecoverable error restoring the association of a physical page with an
expropriated logical page following an attempted access of the
expropriated page by the partition, the platform must signal a Machine
Check Interrupt by returning to the partition’s interrupt vector at
location 0x0200. Note the subsequent firmware assisted NMI and check
exception processing returns a VPM SUE error log (See
).H_MO_PERFThis hcall() applies an artificial memory over-commitment to the
specified pool while monitoring the pool performance for overload,
removing the applied over-commitment if an overload trigger point is
reached. The overload trigger point is designed to double as a dead man
switch, eventually ending the over-commitment condition should the
experiment terminate ungracefully. Only the partition that is authorized
to run platform diagnostics is authorized to make this call.Syntax:Semantics:This description is based upon the architectural model of
, and must be adjusted to
achieve the intent for the specific implementation.Validate that the caller has the required authority; else return
H_AUTHORITY.Validate that the pool parameter references an active memory pool
else; return H_Parameter.Raise the pool’s free list low water mark above its base
value by the signed amount in the mem parameter. (The result is
constrained to not less than the base low water mark value and no more
then the amount of memory in the pool.)Change the permissible pool memory low event counter by the
signed value of the lows parameter.Return in R4 the accumulated rise in the pool’s free list
low water mark above its base value.Return in R5 the current value of the permissible pool memory low
event counter.On each subsequent low memory event (page allocation where the
free list is at or below the low water mark), the permissible pool memory
low event counter is decremented. Should the counter ever reach zero, the
pool’s free list low water mark is returned to its base
value.Expropriation/Subvention Notification Option
The Expropriation/Subvention Notification Option (ESN) sub option
of the CMO option allows implementing platforms to notify supporting
OS’s of delays due to their access of an expropriated VPM page
(such as would be experienced during a “page in” operation).
With an expropriation notification, the OS may block the affected process
and dispatch another rather than having the platform block the virtual
processor that happened to be running the affected process. An
expropriation notification is paired with a subsequent subvention
notification signaled when the original access succeeds. Additionally new
page states allow the OS to indicate pages that it can restore itself,
thus relieving the platform from the burden of making copies of those
pages when they are expropriated and potentially side stepping the
“double paging problem” wherein the platform pages in a page
in response to a touch operation in preparation for an OS page in only to
have the OS immediately discard the page data without looking at
it.The ESN option includes the following LoPAR extensions:Define augmented CMO page statesDefine the per partition Subvention Notification Structure
(SNS)Define H_REG_SNS hcall() to register the SNSDefine Expropriation Notification field definitions within the
VPADefine expropriation and subvention event interrupts.R1--1.For the ESN option: The partition must be running
under the CMO option.R1--2.For the ESN option: The platform must ignore/disable
all other ESN option functions and features unless the OS has
successfully registered the Subvention Notification Structure via the
H_REG_SNS hcall. See
for details.ESN Augmentation of CMO Page Usage StatesThe ESN option augments the set of page states defined by the CMO
option that are set by the cooperating OS using the flags parameter of
the HPT hcall()s. The platform uses these page states to estimate the
overhead associated with expropriating the specific page.Active- An Expropriation notification on this
type of page allows the OS to put the using process to sleep, until the
page is restored, as signaled by a corresponding subvention notification,
at which time the affected instruction is retried.Expendable- the page retains data that the OS
can regenerate, for example, a text page that is backed up on disk;
usually the page is mapped read only. A reflected expropriation
notification on this type of page requires the OS to restore the page
- thus the platform presents somewhat different interrupt status
from that used by an Active page.Expropriating an “Expendable” page should result in
lower overhead than expropriating an “Active” page since
the contents need not be paged out before the page is reused.An Expendable page that is Bolted while not illegal has to be
treated as an “Active” page since an access to a Bolted
page may not result in an expropriation notification.Latent- the page contains data that the OS can
regenerate unless the contents have been modified - at which time
the page state appears to be “Active”, this is similar to
“Expendable” for pages mapped read/write. For example, a page
of a mapped file. Expropriation Notification is like “Active”
or “Expendable” above.Loaned- When the OS does access the page, it
is likely that the access will result in an expropriation
notification
ESN Augmentation of CMO Page Usage State flags
DefinitionFlag bit 28Flag bits 29 - 30Flag bit 31Comments1010Set page state to LatentNote: If Expropriation Notification is disabled, or the
bolted bit (HPT bit 59) is set to 1, the page state to Active
(Active Critical if flag bit 31=1).011Set page State to Latent Critical100Set page state to Expendable101Set page state to Expendable Critical
Expropriation NotificationUnder the ESN option, notice of an attempt to access an
expropriated page is given when the Expropriation Interrupt is enabled in
the virtual processor VPA. Additionally the virtual processor VPA
Expropriation Correlation Number and Expropriation Flags fields are set
to allow the affected program to determine when the access may succeed
and if the program needs to restore data to the Subvened page, see
details in
. Once the VPA has been
updated, the platform presents an Expropriation Fault interrupt to the
affected virtual processor see details in
.ESN VPA FieldsR1--1.For the ESN
option: The platform must support the VPA field definitions of
,
, and
.
Firmware Written VPA Starting at Byte Offset
0x1780x178 F0x1790x17A0x17B0x17C0x17D0x17E0x17Reserved for firmware locksReservedExpropriation CorrelationNumber FieldExpropriation Flags -- See
.
Note: The Expropriation Flags and Expropriation
Correlation Number Fields are volatile with respect to Expropriation
Notifications thus it should be saved by the OS before executing any
instruction that may access unbolted pages.
Expropriation Flags at VPA Byte Offset 0x17D01234567Bit NumberReserved (0)0The Subvened page data is/will be zero1The Subvened page data will be restored.
Expropriation InterruptWhen the platform is running with real memory over-commitment,
eventually a partition virtual processor will access a stolen page. The
transparent solution is to block the virtual processor until the platform
has restored the page. By enabling the Expropriation Interrupt via the
Expropriation Interrupt Enable field of the VPA (see
) the cooperating OS indicates
that it is prepared to make use of its virtual processors for other
purposes during the page restoration and/or restore the contents of
“expendable” and unmodified “latent”
pages.R1--1.For the ESN option: When the partition accesses an
expropriated page and either the page was bolted (PTE bit 59=1) or the
Expropriation Interrupt Enable bit of the affected virtual
processor’s VPA is off see
, then the platform must
recover the page transparently without an Expropriation Interrupt.R1--2.For the ESN option:
When the partition accesses an expropriated page and the summation of the
partition’s in use subvention event queue entries plus outstanding
subvention events is equal to or greater than the size of the
partition’s subvention event queue, the platform must recover the
page prior to issuing any associated Expropriation Interrupt.Note: Requirement
prevents the overflow of the
subvention queue.R1--3.For the ESN option: When the partition accesses an
expropriated “Unused” or “Expendable” page, the
platform must, unless prevented by
, set bit 7 of the affected
processor’s Expropriation Flags VPA byte (see
) to 0b0; else the platform
must set the bit to 0b1.R1--4.For the ESN option: When the partition accesses an
expropriated page and the platform associates a physical page with the
logical page prior to returning control to the affected virtual
processor, the platform must, unless prevented by
, set the Expropriation
Correlation Number field of the affected virtual processor’s VPA to
0x0000 (see
).R1--5.For the ESN option: When the partition accesses an
expropriated page, the platform is not prevented by
, does not associate a physical
page with the logical page prior to returning control to the affected
virtual processor, and the restoration of the logical page has NOT
previously been reported to the OS with an expropriation notification,
the platform must, set the Expropriation Correlation Number field of the
VPA to a non-zero unique value for all outstanding recovering pages for
the affected partition.R1--6.For the ESN option: When the partition accesses an
expropriated page, the platform is not prevented by
, does not associate a physical
page with the logical page prior to returning control to the affected
virtual processor, and the restoration of the logical page has previously
been reported to the OS with an expropriation notification, the platform
must set the Expropriation Correlation Number field of the VPA to the
same value as was supplied with the previous expropriation notification
event associated with the outstanding recovering page for the affected
partition.R1--7.For the ESN option: When the partition performs an
instruction fetch from an expropriated page, the platform must, unless
prevented by
, signal the affected virtual
processor with an Expropriation Interrupt by returning to the affected
virtual processor’s interrupt vector at location 0x0400 with the
processor’s MSR, SRR0 and SRR1 registers set as if the instruction
fetch had experienced a translation fault type of Instruction Storage
Interrupt except that SRR1 bit 46 (Trap) is set to a one.R1--8.For the ESN option: When the partition performs a
load or store instruction that accesses an expropriated page, the
platform must, unless prevented by
, signal the affected virtual
processor with an Expropriation Interrupt by returning to the affected
virtual processor’s interrupt vector at location 0x0300 with the
processor’s MSR, DSISR, DAR, SRR0 and SRR1 registers set as if the
storage access had experienced a translation fault type of Data Storage
Interrupt except that SRR1 bit 46 (Trap) is set to a one.ESN Subvention Event NotificationESN uses an event queue within the Subvention Notification
Structure (SNS) to notify the OS of page subvention operations.
Subvention events have a two byte SNS-EQ entry which has the value of the
expropriation correlation number from the associated expropriation
notification eventR1--1.For the ESN option: The platform must implement the
structures, syntax and semantics described in
,
, and
.SNS Memory AreaR1--1.For the ESN option: The platform
must support the 4K byte aligned SNS not spanning its page boundary
defined by
.
Subvention Notification StructureAccessOffsetUsageWritten by OS Read by Hypervisor0x00BitNotification Control0Notification Trigger1-7ReservedWritten by Hypervisor Read by OS0x01BitEvent Queue State00 = Operational1 = Overflow1-7ReservedSet to non-zero by Hypervisor Read and cleared to zero by
OS0x02-0x02First SNS EQ Entry...(SNS Length -2) - SNS Length - 1)Last SNS EQ Entry
SNS Registration (H_REG_SNS)Syntax:Semantics:If the Address parameter is -1 then deregister any previously
registered SNS for the partition, disable ESN functions and return
H_Success. (Care is required on the part of the OS not to create any
Restoration Paradox Failures prior to registering a new SNS. See
for details.)If the Shared Logical Resource option is implemented and the
Address parameter represents a shared logical resource that has been
rescinded, then return H_RESCINDED.If the Address parameter is not 4K aligned in the valid logical
address space of the caller, then return H_Parameter.If the Length parameter is less than 256 or the Address plus
Length spans the page boundary of the page containing the starting
logical address, then return H_P2.Register the SNS structure for the calling partition by saving
the partition specific information:Record the SNS starting addressRecord the SNS ending addressRecord the next EQ entry to fill address (SNS starting address
+2)Set the SNS interrupt toggle = SNS Notification TriggerSet the SNS Event Queue State to “Operational”Return:R3: H_SuccessR4: Value to be passed in the “unit address”
parameter of the H_VIO_SIGNAL hcall() to enable/disable the virtual
interrupt associated with the transition of the SNS from empty to
non-empty.R5: Interrupt source number associated with the SNS empty to
non-empty virtual interrupt.SNS Event ProcessingThe following sequence is used by the platform to post an SNS
event. The SNS-EQ used corresponds to the EEN event type. This sequence
refers to fields described in
.If the SNS EQ overflow state is set, exit./* An EQ overflow drops all new events until software recovers the
EQ*/Using atomic update protocol, store the event identifier into
the location indicated by the SNS next EQ entry to fill pointer if the
original contents of the location were zero; else set the associated EQ
overflow state and exit./* The value of zero is reserved for an unused entry -- an EQ
overflow drops the new event */Increment the SNS next EQ entry to fill pointer by the size of
the EQ entry (2) modulo the size of the EQ/* Adjust fill pointer */If the SNS interrupt toggle = SNS Notification Trigger then
exit./* Exit on no event queue transition */Invert the SNS interrupt toggle./* Remember event queue transition */If the SNS interrupt is enabled, signal a virtual interrupt to
the associated partition./* Signal transition when enabled */ESN InterruptsThe ESN option may generate several interrupts to the partition OS.
Defined in this section are those in addition to the Expropriation
Notification interrupts defined above.Subvention Notification Queue Transition
InterruptR1--1.For the ESN option: When the
platform has restored the association of a physical page with the logical
page that caused an Expropriation Notification interrupt with a non-zero
Expropriation Correlation Number, the platform must post the
corresponding Expropriation Correlation Number to the Subvention Event
Queue see
.Restoration Paradox FailureRestoration Paradox Failures result in an unrecoverable memory
failure machine check.R1--1.For the ESN option: When the platform finds that
Expropriation Notification has been disabled after it has discarded the
contents of an “Expendable” page, it must treat any access to
such a page as an unrecoverable error restoring the association of a
physical page with the expropriated logical page.Virtual Partition Memory Pool Statistics Function
SetThe hcall-vpm-pstat function set may be configured via the
partition definition in none or any number of partitions as the VPM
administrative policy dictates.H_VPM_PSTATThis hcall() returns statistics on the physical shared memory pool.
Since these statistics can be manipulated by the processing of a single
partition, there is a risk of creating a covert channel through this
call. To mitigate this risk, the call is contained in a separate function
set that can be protected by authorization methods outside the scope of
LoPAR.Syntax:Parameters:Input: NoneOutput:R4: Total VM Pool Page FaultsR5: Total Page Fault Wait Time (Time Base Ticks)R7: Total Pool Physical MemoryR8: Total Pool Physical Memory that is I/O mappedR9: Total Logical Real Memory that is Virtualized by the VM
PoolSemantics:Verify that calling partition has the authority to make the call
else return H_Authority.Report the statistics for the memory pool used to instantiate the
virtual real memory of the calling partition.Place in R4 the summation of the virtual partition memory page
faults against the memory pool since the initialization of the
pool.Place in R5 the summation of timebase ticks spent waiting for the
page faults indicated in R4.Place in R6 the total amount of physical memory in the memory
pool.Place in R7 the summation of the entitlement of all active
partitions served by the memory pool.Place in R8 the summation of the I/O mapped memory of all active
partition served by the memory pool.Place in R9 the summation of the logical real memory of all
active partitions served by the memory pool.Return H_Success.Logical Partition Control ModesSelected logical partition control modes may be modified by the
client program.Secondary Page Table Entry Group (PTEG) SearchThe page table search algorithm, described by the
, consists of searching for a
Page Table Entry (PTE) in up to two PTEGs. The first PTEG searched is the
“primary PTEG”. If a PTE match does not occur in the primary
PTEG, the hardware may search the “secondary PTEG”. If a PTE
match is not found in the searched PTEGs, the hardware signals a
translation exception.Code is not required to place any PTEs in secondary PTEGs.
Therefore, if a PTE match does not occur in a primary PTEG there is no
need for the hardware to search a secondary PTEG to determine that a
search has failed. The “Secondary Page Table Entry Group” bit
of
“ibm,client-architecture-support” allows
code to indicate that there is no need to search secondary PTEGs to
determine that a PTE search has failed.Memory Table Translation Option ExploitationStarting with platforms build upon POWER processors supporting ISA level 3.0,
the platform supports the In-Memory Table Translation option. This option allows
the memory management unit to perform effective to physical address translation
based off of a single tree of in-memory translation tables, rooted by a single
physical memory address pointer. The option also supports two level radix tree
page tables as well as traditional POWER hash page tables. As initially configured,
partitions that use hash page tables run with legacy Segment Lookaside Buffers
(LPCR [UPRT] = FALSE). To fully exploit the In-Memory Table Translation option,
the hash page table client program registers a process table from its own memory
which sets (LPCR [UPRT] = TRUE). On the other hand, radix page table client programs
need to register a process table before they turn on address translation.Each guest partition in the system, may register, within the tree of in-memory
translation tables, its own table (process table) which controls translation of its
process effective addresses to guest virtual / real. The process table is then used
by nest memory management unit for nest accelerator and CAPI attached device accesses, a
nd optionally for processor memory management unit translations. Additionally the
platform might support the client program to directly invalidate cached process
table translation data (when the client program modifies the in-memory table).
If the platform does not support the client program directly issuing process
table cache invalidate instructions, then the client program must use the set
of in-memory table cache invalidate hcall()s in sections
and .Note:
The CAS option vector processing associated with the In-Memory
Table Translation option (vector 5 byte 23) carries a special semantic.H_CLEAN_SLBThe Segment Lookaside Buffers (SLB) are a software managed
coherency cache of the per process segment table. The client may
directly issue instructions to clear the SLB on the issuing processor;
however, clearing entries on other processors or the nest memory
management unit requires hypervisor assistance. The H_CLEAN_SLB hcall()
provides the client program with the means for clearing SLB
contents that might be stale. The platform provides through the flags parameter
options as to the scope of the entries that are cleaned, these include:Clean the nest MMU SLBs of all entries associated with a
specified caller process (esid parameter is set to zero).Clean all platform SLBs of a specific ESID for a specified caller
process (flags parameter C and B fields specify SLB Class and Size respectively).Syntax:Semantics:If a reserved flags bit == TRUE return H_ParameterIf flags[62] == flags[63] return H_ParameterIf flags[63] and esid <> 0 return H_P3Validate that the calling partition is not mounting a denial of
service attack else return H_LongBusyOrder1mSec.Perform the following sequence:ptesyncIf flags[62] then slbieg with RS = pid parameter ||
caller’s LPID, RB= esid || C || 0b0 || B || 0b0 || 0x000000If flags[63] then slbiag with RS = pid parameter || caller’s LPIDeieioslbsyncptesyncReturn H_SUCCESS.H_INVALIDATE_PIDThe H_INVALIDATE_PID hcall() invalidates any system translate lookaside
buffer entries from the caller’s specified (pid parameter) process table entry.Syntax:Semantics:If flags [0:61] <> 0 return H_ParameterValidate that the calling partition is not mounting a denial of
service attack else return H_LongBusyOrder1mSec.RB = 0x400 /* Invalidation Selector (IS) = 01 (Invalidate matching PID.) */If flags[62] then RB = RB + RB /* Invalidation Selector
(IS) = 10 (Invalidate matching LPID.) */Perform the following sequence:ptesynctlbie (RIC=2, PRS=1, R=flags[63]), RS=pid||caller’s_LPID, RBeieiotlbsyncptesyncReturn H_SUCCESS.H_REGISTER_PROCESS_TABLEThis hcall() is used by the client program to manage the its virtual
address translation mode including registration of its process table. The
calling program needs to be prepared for the change in address translation
that is being requested, for instance, the calling program might choose to
be running with relocation off and with all other processors either spinning
with relocation off or in the stopped state.The caller might need to invoke the H_REGISTER_PROCESS_TABLE hcall()
multiple times for it to return with a return code of H_Success. Upon
receiving a return code of H_LongBusyOrder10mSec, the caller should attempt
to invoke H_REGISTER_PROCESS_TABLE in 10mSec with the same parameter values
used on the previous H_REGISTER_PROCESS_TABLE hcall(). Invoking
H_REGISTER_PROCESS_TABLE with a different parameter values indicates
that the caller wants to transition to the parameter values of the most
recent H_REGISTER_PROCESS_TABLE call.The platform may implement a subset of the functions implied by the
flags parameter definition below. This subset is reported in the value of
byte 23 of the
“ibm,architecture-vec-5”
property of the
/chosen node. A
request for an unimplemented function results in an H_Parameter return code.Syntax: 0b11 then the following parameters shall */
/* be = 0 else */
uint64 base, /* Base address of the process table */
/* For flags 61 = 0 the VSID number of a one terabyte */
/* segment (right justified in the register) */
/* For flags 61 = 1 the 4K aligned guest real address */
uint64 page_size, /* For flags 61 = 0 Size of the pages within the table */
/* encoded as per the L||LP device tree encoding */
/* else = 0 */
uint64 table_size); /* Size of the process table in bytes */
/* Encoded as the integer */
/* (log2 (total table length in bytes)) – 12 */
/* (table_size <= 24) */]]>Semantics:Validate that no reserved flags parameter bits are TRUE and that the
defined bits setting is supported else RETURN H_Parameter.If “flags” indicate change to process table (flags[59] is TRUE) then:If “flags” indicate deregistration (flags[58] is FALSE) then set
Partition_Table[calling-partition,word_2] to a platform dependent
benign value; else – Based upon the mode specified in “flags[61-62]”:Validate “base” parameter else RETURN to H_P2Validate “page_size” parameter relative to platform
support else RETURN H_P3Validate (0 => “table_size” parameter <=24) else RETURN H_P4Set Partition_Table[calling-partition,word_2] to the value
specified by the “flags[61-62]”, “base”, “page_size”, and “table_size” parameters.EndifEndifIf “flags” indicate HPT/SLB mode (flags[61-62] is 0b00) then set
LPCR[UPRT] to FALSE else set LPCR[UPRT] to TRUEInput parameters:The “flags” parameter communicates the desired operation. The “base”
parameter specifies the VSID number of a one terabyte segment (right
justified in the register). The “page_size” parameter specifies the size
of the pages within the table encoded as per the L||LP encoding used by the
HPT hcalls that is presented in the page size info in the device tree. The
“table_size” parameter specifies the total size of the process table encoded
as the integer (log2 (total table length in bytes)) – 12 (table_size <= 24).Partition Energy Management Option (PEM)This section describes the functional interfaces that are available
to assist the partition Operating System optimize trade offs between energy
consumption and performance requirements.Long Term Processor CedeTo enable the hypervisor to effectively reduce the power draw from
unused partition processors, the concept of cede wakeup latency is
introduced with the Partition Energy Management Option. A one byte cede
latency specifier VPA field communicates the maximum latency class that
the OS can tolerate on wakeup from H_CEDE. In general the longer the
wakeup latency the greater the savings that can be made in power drawn by
the processor during a cede operation. However, due to implementation
restrictions, the platform might be unable to take full advantage of the
latency that the OS can tolerate thus the cede latency specifier is
considered a hint to the platform rather than a command. The platform may
not exceed the latency state specified by the OS. Calling H_CEDE
, with value of the cede
latency specifier set to zero denotes classic H_CEDE behavior. Calling
H_CEDE with the value of the cede latency specifier set greater than zero
allows the processor timer facility to be disabled (so as not to cause
gratuitous wake-ups - the use of H_PROD, or other external
interrupt is required to wake the processor in this case). An External
interrupt might not awake the ceded process at some of the higher (above
the value 1) cede latency specifier settings. Platforms that implement
cede latency specifier settings greater than the value of 1 implement the
cede latency settings system parameter see
. The hypervisor is then free to take energy
management actions with this hint in mind.R1--1.For the PEM option: The platform must honor the OS
set cede latency specifier value per the definition of
.R1--2.For the PEM option: The platform must map any OS set
cede latency specifier value into one of its implemented values that does
not exceed the latency class set by the OS.R1--3.For the PEM option: The platform must implement the
cede latency specifier values of 0 and 1 per
.R1--4.For the PEM option: If the platform implements cede
latency specifier values greater than 1 it must implement the cede
latency settings values sequentially without holes.R1--5.For the PEM option: If the platform implements cede
latency specifier values greater than 1 each sequential cede latency
settings value must represent a cede wake up latency not less than its
predecessor, and no less restrictive than its predecessor.R1--6.For the PEM option: If the platform implements cede
latency specifier values greater than 1 it must implement the cede
latency settings system parameter see
.H_GET_EM_PARMSThis hcall() returns the partition’s energy management
parameters. The return parameters are packed into registers.Programming Note: On platforms that implement the
partition migration option, after partition migration:
The support for this hcall() might change, the caller should be
prepared to receive an H_Function return code indicating the platform
does not implement this hcall().Fields that were defined as “reserved” might contain
data; calling code should be tested to ensure that it ignores fields
defined as “reserved” at the time of its design, and that it
operates properly when encountering “zeroed” defined fields
that indicate that the field does not contain useful data.Implementation Note: To aid the testing of calling
code, implementations would do well to include debug tools that seed
reserved return fields with random data.Syntax:Parameters: (on return)Status Codes (bit offset within 2 byte field): Bits 0:5 Reserved
(zero)Bits 6:8 Energy Management major code:0b000: Non-floor modes:Bits 9:15 Energy Management minor code:0x00: The energy management policy for this aggregation level is
not specified.0x01: Maximum Performance (Energy Management enabled -
performance may exceed nominal)0x02: Nominal Performance (Energy Management Disabled)0x03: Static Power Saving Mode0x04: Deterministic Performance (Energy Management enabled
- consistent performance on a given workload independent of
environmental factors and component variances)0x05 - 0x7F Reserved0b001: Dynamic Power Management:Bits 9:15 Performance floor as a percentage of nominal (0% -
100%).0b010:0b111 ReservedImplementation Note: Status Code Fields are
determined by means outside the scope of LoPAR. Platform designs may
define a hierarchy of aggregations in which lower levels by default
inherit the energy management policy of their parent.Bytes 0:3 four byte Power Draw Status/Limit for the
platformBit 0: Power Draw Limit is hard/soft: 0 = Soft, 1 = HardBits 1:7 Reserved.Bits 8:31 unsigned binary Power Draw Limit times 0.1 wattsThe total processor energy consumed by the calling partition
since boot in Joules times 2**-16. The value zero indicates that the
platform does not support reporting this parameter.The total memory energy consumed by the calling partition since
boot in Joules times 2**-16. The value zero indicates that the platform
does not support reporting this parameter.The total I/O energy consumed by the calling partition since boot
in Joules times 2**-16. The value zero indicates that the platform does
not support reporting this parameter.Semantics:Place the partition’s performance parameters for the
calling virtual processor’s partition into the respective
registers:R4: Energy Management Status CodesR5: Power Draw Limits (Platform and Group)R6: Power Draw Limits (Pool and Partition)R7: Partition Processor Energy ConsumptionR8: Partition Memory Energy ConsumptionR9: Partition I/O Energy ConsumptionReturn H_Success.R1--1.For the PEM option: The platform must implement the
H_GET_EMP hcall() following the syntax and semantics of
.H_BEST_ENERGYThis hcall() returns a hint to the caller as to the probable impact
toward the goal of minimal platform energy consumption for a given level
of computing capacity that would result from releasing or activating
various computing resources. The returned value is a unitless priority,
the lower the returned value; the more likely the goal will be achieved.
The accuracy of the returned hint is implementation dependent, and is
subject to change based upon actions of other partitions; thus the
implementation can only provide a “best effort” to be
“substantially correct”. Implementation dependent support for
this hcall() and supported resource codes might change during partition
suspension as in partition hibernation or migration; the client program
should be coded to gracefully handle H_Function, H_UNSUPPORTED, and
H_UNSUPPORTED_FLAG return codes.H_BEST_ENERGY may be used in one of two modes,
“inquiry” or “ordered” specified by the setting
of bit 54 of the eflags parameter. It is intended that ordered mode be
used when the client program is largely indifferent to the specific
resource instance to be released or activated. In ordered mode,
H_BEST_ENERGY returns a list of resource instances in the order from the
best toward worst to choose to release/activate to achieve minimal energy
consumption starting with an initial resource instance in the ordered
list (if the specified initial resource is the reserved value zero the
returned list starts with the resource having the greatest probability of
minimizing energy consumption). It is intended that inquiry mode be used
when the client program wishes to compare the energy advantage of making
a resource selection from among a set of candidate resource instances. In
inquiry mode, H_BEST_ENERGY returns the unitless priority of
releasing/acquiring each of the specified resource instances. It is
expected that in the vast majority of cases, the client code will receive
data on a sufficient number of resource instances in one H_BEST_ENERGY
call to make its activate/release decision; however, in those rare cases
where more information is needed, a series of H_BEST_ENERGY calls can be
made to accumulate information on an arbitrary number of computing
resource instances.Platforms may optionally support “buffered ordered”
return data mode. If the platform supports “buffered ordered”
return data mode, a “b” suffix appears at the end of the list
that terminates the hcall-best-energy-1 function set entry. If the
“buffered ordered” return data mode is supported the caller
may specify the “B” bit in the eflags parameter and supply in
P3 the logical address of a 4K byte aligned return buffer.The probable effects of a given resource instance selection might
vary depending upon the intention of the client program to take other
actions. These other actions include the ability to reactivate a released
resource within a given time latency and number of resources the client
program intends to activate/release as a group. The eflags parameter to
H_BEST_ENERGY contains fields that convey hints to the platform of the
client program intentions in these areas; implementations might take
these hints into consideration as appropriate. The high order four (4)
bytes of the eflags parameter contain the unsigned required reactivation
latency in time base ticks (the reserved value of all zeros indicates an
unspecified reactivation latency).Calling H_BEST_ENERGY with the eflags “refresh” flag
(bit 54) equal to a one causes the hypervisor to compute the relative
unitless priority value (1 being the best to activate/release with
increasing numbers being poorer choices from the perspective of potential
energy savings) for each instance of the specified resource that is owned
by the calling partition. If the hypervisor can not distinguish a
substantially different estimate for the various resource instances the
call returns H_Not_Available. If the “refresh” flag is equal
to a zero, the list as previously computed is used. Care should be
exercised when using the non-refresh version to ensure that the state of
the partition’s owned resource list has been initialized at some
point and has not changed due to resource instance activation/release
(including dynamic reconfiguration) activities by other partition threads
else the results of the H_BEST_ENERGY call are unpredictable (ranging
from inaccurate prediction values up to and including error code
responses).The return values for H_BEST_ENERGY are passed in registers.
Following standard convention, the return code is in R3. Register R4
contains the response count. If the call is made in “inquiry”
mode the response count equals the number of non-zero requested resource
instance entries in the call. If the call is made in
“ordered” mode, the response count contains the number of
entries in the ordered list from the first entry returned until the worst
choice entry. If the response count is <= 8 (512 for ordered buffer
mode) then the response count also indicates how many resource instances
are being reported by this call, if the response count is >8 (512 for
ordered buffer mode) then this call reports eight (512 for ordered buffer
mode) resource instances. Each response consists of three fields: bytes 0
-- 2 are reserved, byte 3 contains the unitless priority for selecting
the indicated resource instance, and bytes 4 -- 7 contain the resource
instance identifier value corresponding to that passed in the
“ibm,my-drc-index” property.In order to represent more accurately the significance of certain
priority values relative to others, the platform might leave holes in the
ranges of reported priority values. As an example there may be a gap of
several priority numbers between the value associated with a resource
that can be powered down versus one that can only be placed in an
intermediate energy mode, and yet again another gap to a resource that
represents a necessary but not sufficient condition for reducing energy
consumption.Syntax:Parameters: (on entry)(on return)R3: Return codeR4: Response Count Value <8 indicate the number of returned
values in registers starting with R5. The contents of registers after the
last returned value as indicated by the Response Count Value are
undefined.R5 -- R12 Bytes 0-2 ReservedByte 3: 1 - 255 -- unitless priority value relative to
lowest total energy consumption for selecting the corresponding resource
ID.Bytes 4-7 Resource instance ID to be used as input to dynamic
reconfiguration RTAS calls as would the value presented in the
“ibm,my-drc-index” property.Semantics:If the resource code in the eflags parameter is not supported
return H_UNSUPPORTEDIf other binary eflags values are not valid then return
H_UNSUPPORTED_FLAG with the specific value being (-256 - the bit
position of the highest order unsupported bit that is a one);If the eflags parameter “refresh” bit is zero and the
list has not been refreshed since the last return of H_Not_Available then
return H_Not_Available.If the eflags parameter “refresh” bit is a one
then:If energy estimates for the partition owned resources are
substantially indistinguishable then return H_Not_Available.Assign a priority value to each resource of the type specified in
the resource code owned by the calling partition relative to the probable
effect that selecting the specified resource to activate/release (per
eflags code) within the specified latency requirements would have on
achieving minimal platform energy consumption. (1 being the best
increasing values being worse - implementations may choose to use
an implementation dependent subset of the available values)Order the specified resources owned by the calling partition
starting with those having a priority value of 1; setting the resource
pointer to reference that starting resource.If the eflags parameter bit 54 is a one (“ordered”)
thenIf P2 == 0 then set pointer to best resource in ordered
listElseIf P2 <> the drc-index of one of the resources in the
ordered list then return H_P2Else set pointer to the resource corresponding to P2Set R4 to the number of resources in the ordered list from the
pointer to the endIf eflags “B” bit == 0b0 then /* this assumes that
the ordered buffer option is supported */:If R4 > 8 set count to 8 else set count to R4Load “count” registers starting with R5 with the
priority value and resource IDs of the “count” resource
instances from the ordered list starting with the resource instance
referenced by “pointer”.ElseIf P3 does not contain the 4K aligned logical address of a
calling partition memory page then return H_P3If R4 > 512 set count to 512 else set count to R4Load “count” 8 byte memory fields starting with
logical address in R3 with the priority value and resource IDs of the
“count” resource instances from the ordered list starting
with the resource instance referenced by “pointer”.Return H_SUCCESSElse /* “inquiry” mode */Set R4 to zeroFor each input parameter P2 -- P9 or until the input parameter is
zeroIf the input parameter Px <> the drc-index of one of the
resources in the ordered list then return H_PxFill in byte 3 of the register containing Px with the priority
value of the resource instance corresponding to the drc-index (bytes 4 --
7) of the register.Increment R4Return H_SUCCESSR1--1.For the PEM option: The platform must implement the
H_BEST_ENERGY hcall() following the syntax and semantics of
.Platform FacilitiesThis section documents the hypervisor interfaces to optional platform
facilities such as special purpose coprocessors.H_RANDOMIf the platform supports a random number generator platform
facility the
“ibm,hypertasfunctions” property of the
/rtas node contains the function set specification
“hcall-random” and the following hcall() is supported.Syntax:Co-Processor FacilitiesIf the platform supports a co-processor platform facility the
“ibm,hypertas-functions” property of the
/rtas node contains
the function set specification “hcall-cop” and the following
hcall()s are supported.For asynchronous coprocessor operations the caller may either
specify an interrupt source number to signal at completion or the caller
may poll the completion code in the CSB. The hypervisor and caller need
to take into account the processor storage models with explicit memory
synchronization to ensure that the rest of the return data from the
operation is visible prior to setting the CSB completion code, and that
any operation data that might have been fetched prior to the setting of
the CSB completion code is discarded.Note: The H_MIGRATE_DMA hcall() does not handle data pages subject
to co-processor access, it is the caller’s responsibility to make
sure that outstanding co-processor operations do not target pages that
are being migrated by H_MIGRATE_DMA.H_COP_OPThe architectural intent of this hcall() is to initiate a
co-processor operation. Co-processor operations may complete with either
synchronous or asynchronous notification. In synchronous notification,
all platform resources associated with the operation are allocated and
released between the call to H_COP_OP and the subsequent return. In
asynchronous notification, operation associated platform resources may
remain allocated after the return from H_COP_OP, but are subsequently
recovered prior to setting the completion code in the CSB. For the
partition migration option no asynchronous notification operation may be
outstanding at the time the partition is suspended.Syntax:Syntax:Flags:Reserved (bits 0-- 38)“Rc” (bit 39) On Asymmetric Encryption operations the
“Rc” bit indicates that the high order 16 bits of the
“in” parameter contain the “Rc” field specifying
the encoded operand length while the remainder of the “in”
and “inlen” parameter bits are reserved and should be
0b0Notification of Operation (bits 40-- 41):00 Synchronous: In this mode the hypervisor synchronously waits
for the coprocessor operation to complete. To preserve Interrupt service
times of the caller and quality of service for other callers, the length
of synchronous operations is restricted (see inlen parameter).01 Reserved10 Asynchronous: In this mode the hypervisor starts the
coprocessor operation and returns to the caller. The caller may poll for
operation completion in the CSB.11 Async Notify: In this mode the hypervisor starts the
coprocessor operation as with the Asynchronous notification above however
the operation is flagged to generate a completion interrupt to the
interrupt source number given in the
“ibm,copint” property. When the interrupt
is signaled the caller may check the operation completion status in the
CSB.Interrupt descriptor index for Async Notify (bits 42-- 55)FC field: The FC field is the co-processor name specific function
code (bits 56-- 63)Resource identifier (bits 32-- 63(as from the
“ibm,resource-id” property))in/inlen and out/outlen parameters:If the *len parameter is non-negative; the respective in/out
parameter is the logical real address of the start of the respective
buffer. The starting address plus the associated length may not extend
beyond the bounds of a 4K page owned by the calling partition. For
synchronous notification operations, the parameter values may not exceed
an implementation specified maximum; in some cases these are communicated
by the values of the
“ibm,maxsync-cop” property of the device
tree node that represents the co-processor to the partition.If the *len parameter is negative; the respective in/out
parameter is the logical real address of the start of a scatter/gather
list that describes the buffer with a length equal to the absolute value
of the *len parameter. The starting address of the scatter/gather list
plus the associated length may not extend beyond the bounds of a 4K page
owned by the calling partition. Further the scatter/gather list shall be
a multiple of 16 bytes in length not to exceed the value of the
“ibm,max-sg-len” property of the device tree
node that represents the coprocessor to the partition. Each 16 byte entry
in the scatter gather list consists of an 8 byte logical real address of
the start of the respective buffer segment. The starting address plus the
associated length may not extend beyond the bounds of a 4K page owned by
the calling partition. For synchronous notification operations, the
summation of the buffer segment lengths for the in scatter/gather list
may be limited; in some cases these limitations are communicated by the
value of the
“ibm,max-sync-cop” property of the device
tree node that represents the coprocessor to the partition.csbcpb: logical real address of the 4K naturally aligned memory
block used to house the co-processor status block and FC field dependent
co-processor parameter block.Output parameters on returnR3 contains the standard hcall() return code: if the return code
is H_Success then the contents of the 4K naturally aligned page specified
by the csbspb parameter are filled from the hypervisor csb and cpb with
addresses converted from real to calling partition logical realSemantics:The hypervisor checks that the resource identifier parameter is
valid for the calling partition else returns H_RH_PARM.The hypervisor checks that for the coprocessor type specified by
the validated resource identifier parameter there are no non-zero
reserved bits within the function expansion field of the flags parameter
else returns H_UNSUPPORTED_FLAG for the highest order non-zero
unsupported flag.If the operation notification is asynchronous, check that there
are sufficient resources to initiate and track the operation else return
H_Resource.The hypervisor checks that the flag parameter notification field
is not a reserved value and FC field is valid for the specified
coprocessor type else returns H_ST_PARMIf the notification field is “synchronous” the
hypervisor checks that the FC field is valid for synchronous operations
else return H_OP_MODE.The hypervisor builds the CRB CCW field per the coprocessor type
specified by the validated resource identifier parameter and by copying
the coprocessor type defined number of FC field bits from the low order
flags parameter FC field to the corresponding low order bits of CCW byte
3.If the resource ID is an asymmetric encryption then (If the Flags
Parameter “Rc” bit is on then check the High order 16 bits of
the “in” parameter for a valid “Rc” encoding and
transfer to the CRB starting at byte 16 else return H_P2) else Validate
the inlen/in parameters and build the source DDEVerify that the “in” parameter represents a valid
logical real address within the caller’s partition else return
H_P2If the “inlen” parameter is non-negative:Verify that the logical real address of (in + inlen) is a valid
logical real address within the same 4K page as the “in”
parameter else return H_P3.If the operation notification is synchronous verify that the
combination of parameter values request a sufficiently short operation
for synchronous operation else return H_TOO_BIG.If the “inlen” parameter is negative:Verify that the absolute value of inlen meet all of the follow
else return H_P3:Is <= the value of
“ibm,max-sg-len”Is an even multiple of 16That in + the absolute value of inlen represents a valid logical
real address within the same 4K caller partition page as the in
parameter.Verify that each 16 byte scatter gather list entry meets all of
the following else return H_SG_LIST:Verify that the first 8 bytes represents a valid logical real
address within the caller’s partition.Verify that the logical real address represented by the sum of
the first 8 bytes and the second 8 bytes is a valid logical real address
within the same 4K byte page as the first 8 bytes.If the operation notification is synchronous verify that the sum
of all the scatter gather length fields (second 8 bytes of each 16 byte
entry) request a sufficiently short operation for synchronous operation
else return H_TOO_BIG.For the Shared Logical Resource Option if any of the memory
represented by the in/inlen parameters have been rescinded then return
H_RESCINDED.Fill in the source DDE list from the converted the in/inlen
parameters.Validate the outlen/out parameters and build the target
DDEVerify that the “out” parameter represents a valid
logical real address within the caller’s partition else return
H_P4If the “outlen” parameter is non-negative verify that
the logical real address of (out + outlen) is a valid logical real
address within the same 4K page as the “out” parameter
and for symmetric cryptography operations that outlen => inlen else
return H_P5.If the “outlen” parameter is negative:Verify that the absolute value of outlen meet all of the follow
else return H_P5:Is <= the value of
“ibm,max-sg-len”Is an even multiple of 16That out + the absolute value of outlen represents a valid
logical real address within the same 4K caller partition page as the out
parameterVerify that each 16 byte scatter gather list entry meets all of
the following else return H_SG_LIST:Verify that the first 8 bytes represents a valid logical real
address within the caller’s partition.Verify that the logical real address represented by the sum of
the first 8 bytes and the second 8 bytes is a valid logical real address
within the same 4K page as the first 8 bytes.Accumulate the sum of the second 8 bytes of each scatter gather list entry.Verify that for symmetric cryptography operations the accumulated sum
of the second 8 bytes of each scatter gather list entry =>the input data length
else return H_P5.For the Shared Logical Resource Option if any of the memory
represented by the out/outlen parameters have been rescinded then return
H_RESCINDED.Fill in the destination DDE list from the converted the
out/outlen parameters.If the operation notification is asynchronous then verify that
the input and output buffers do not overlap else return H_OVERLAP (makes
the operations transparently restartable)Check that the csbcpb parameter is page aligned within the
calling address space of the calling partition else return H_P6If the operation specifies a CPB and the specified CPB is invalid
for the operation then return H_ST_PARM.Set the CRB CSB address field & C bit to indicate a valid
CCBIf the operation notification is asynchronous notify,
then:Check that the flags parameter interrupt index value is within
the defined range for the validated rid and is not currently in use for
another outstanding COP operation else return H_INTERRUPT.Set the CRB CM field to command a completion interrupt,.Set the job id field in the Co-processor Completion Block to
command the signaling via the interrupt source number contained the
interrupt specifier indicated by the interrupt index value.For the CMO option, if the number of entitlement granules pinned
for this operation causes the partition memory entitlement to be
exhausted then return H_NOT_ENOUGH_RESOURCES; else pin and record the
entitlement granules used by this operation, and increment the partition
consumed memory entitlement for the number of entitlement granules pinned
for this operation.Set the completion code field in the passed (via csbcpb
parameter) CSB to invalid (it is subsequently set to valid at the end of
the operation just after the rest of the contents of the 4k naturally
aligned page specified by the csbcpb parameter are filled).Issue icswxIf busy response to icswx implementation dependent (may be null)
retry after backoff based upon some usage equality/priority mechanisms
else return H_Busy.If the operation notification is asynchronous then Return
H_SuccessWait for completion posting in CSB (CSB valid bit. 1)The contents of the 4K naturally aligned page specified by the
csbcpb parameter are filled from the hypervisor csb and cpb with
addresses converted from real to calling partition logical realReturn H_Success.H_STOP_COP_OPThe architectural intent of this hcall() is to terminate a
previously initiated co-processor operation.Syntax:Semantics:Check the rid parameter for validity for the caller else return
H_RH_PARMIf any reserved flags parameter bits are non zero then return
H_Parameter.Check the csbcpb parameter for pointing within the caller’s
partition and 4K aligned else return H_P3For the shared logical resource option if the csbcpb parameter
references a rescinded shared logical resource then return
H_RESCINDEDIf the csbcpb parameter is not associated with an outstanding
coprocessor operation then return H_NOT_ACTIVE.Send a kill operation to the coprocessor handling the outstanding
operationWait for the outstanding kill operation to complete.For the CMO option, unpin any entitlement granules still pinned
for this operation and decrement the consumed partition memory
entitlement for the number of entitlement granules pinned for this
operation.Return H_Success.Memory Usage Instrumentation Option (MUI)The MUI Option enables the platform to generate statistics on
page reference affinity, age, access rate, and
reference history pattern. Client programs can
query and act upon this information in order to guide
decisions for improving memory utilization and placement in a system.The MUI Option consists of a number of distinct per page measures
as described in .
Memory Usage Instrumentation MeasuresNameAbbreviationDescriptionReference History Bit ArrayHBAMeasures dormancy patterns – a list of bits each representing a time
interval, if the bit is a one the page was reference during the
corresponding interval.Page Table Entry Update TimePUTThe timestamp of the last HBA interval during which the
corresponding page experienced a TLB miss during a reference.Access Count ArrayACAThe count of the corresponding page access used to compute page
access rate.Page Age CounterPACA saturating count representing the length of time since the page
statistics were reset.Page Age GranulePAGThe update cycle period of the Page Age Counter in seconds.Affinity Log ArrayALAA sampled list of affinity domains that have accessed the page.Affinity Log Sample PeriodALSPThe period that each Affinity Log Array entry represents in seconds.
Client programs monitor and manage the MUI state through the extensions to the H_ENTER
(see ), H_RETURN_PAGEINFO, H_MEMSTAT_CTRL,
H_RESET_MEMSTATS, and H_BULK_READ_HBA hcall()s. The data returned by the
H_RETURN_PAGEINFO and H_BULK_READ_HBA hcall()s generally reflect actual values. However,
should a logical page be faulted back into a partition when Active Memory Sharing (AMS) is in use, its
MUI state is cleared/set to a fixed value. MUI state is also subject to loss during events that move physical
memory such as: dynamic reconfiguration, partition mobility, and fail-over. bit for managing the MUI behavior of pages are detailed below in .
MUI Option Flags in Page Frame Table Access flags Detailed DescriptionBit(s)NameEncodingDescription43:44Reference History Bit Array (HBA)0b00Disable HBA updates0b01Enable HBA updates, set PUT to previous time, but do
not change current HBA content (as in adding an alias to
a logical real page)0b10Enable HBA updates, set PUT to previous time, and set
the HBA to the configured Initial Setting.0b11Reserved – return H_UNSUPPORTED_FLAG45Affinity-ClearResets Affinity Log Array when entering a PTE.46Page-Age-ClearResets Page Age Counter when entering a PTE.
The parameters to Memory Usage Instrumentation hcall()s that specify a given logical page or page range
take the form of an index into the partition’s logical real memory space as if it were a set of 4K pages, the
logical page index being the logical real address of the starting byte of the 4K page, right shifted 12 bits.It is expected that the MUI function will evolve over time, as will the syntax of the MUI hcalls(). The
following requirements ensure forward compatibility over this expected evolution.R1--1.For the MUI option:
the platform must for all MUI hcall()s fill all reserved return parameter
registers with all zeros.R1--2.For the MUI option:
to avoid future incompatibility, the caller of MUI hcall()s must ignore
the contents of all reserved return parameter registers.R1--3.For the MUI option:
to avoid future incompatibility, the caller of MUI hcall()s must fill all
reserved input parameter fields with zeros.H_MEMSTAT_CTRLThis hcall() configures Memory Usage Instrumentation, and returns the current configuration. Note
supplying a value of all zeros returns the current configuration settings of the Memory Usage
Instrumentation facility.Syntax:Parameters:Input:R4: flagsOutput:R3: Return codeR4: ConfigR5::R10: Reserved
Encoded values for Reference History Bit Array (HBA) (flags/config)Bit 0Bit 1Comment00No operationDo not change the HBA configuration or update cycle time, simply return
current settings.01Disable HBA updates10Enable an HBA update on the first TLB miss per HBA update cycle, and set the HBA per
the configured Initial Setting. (Note: If the platform does not support this setting,
H_MEMSTAT_CTRL returns H_RT_PARM, and the value of the HBA field in the
returned memory instrumentation configuration “out” parameter is 0b00.)11Enable an HBA update on the first access* per HBA update cycle and force a TLB miss
per HBA update cycle, and set the HBA per the configured Initial Setting. (Note: If the
platform does not support this setting, H_MEMSTAT_CTRL returns H_RT_PARM, and
the value of the HBA field in the returned memory instrumentation configuration “out”
parameter is 0b00.)
Implmentation Note:
This may be approximated by performing a TLBIA once per HBA update cycle;
thus forcing a TLB miss on the first subsequent page access.The Initial Setting field (flags/config): is a 6 bit field that defines the number of high order HBA bits that
are preloaded to a 1 when the HBA is initialized (for instance when the page is assigned a new virtual
address through H_ENTER). This field allows the software to bias the page statistics so that the page will
not be chosen as a victim before it can establish its own usage statistics.HBA Update Cycle Field (HUC) (flags/config): is a 6 bit field that defines the update cycle period in
microseconds multiplied by the power of two specified in the 6 bit field. The range of supported HUC
values is given in the “ibm,mui-ranges”
property. Note: If the platform does not support this setting, or
the supplied value, H_MEMSTAT_CTRL returns H_RT_PARM, and the value of the HUC field in the
returned memory instrumentation configuration “config” parameter is the reserved value 0b111111.
Encoded values for Access Count Array (ACA) (flags/config)Bit 0Bit 1Comment00No operationDo not change, simply return current setting of ACA configuration.01Disable ACA updatesNote: This setting will prevent the partition from seeing ACA
data, however, the platform may still accumulate such data for other purposes.10Enable ACA updatesNote: If the platform does not support this setting, H_MEMSTAT_CTRL returns
H_RT_PARM and the value of the ACA field in the returned memory instrumentation
configuration “config” parameter is 0b00.On enable the platform is not required to initialize the counters except to preclude a covert
channel as in the case of page reassignment between partitions.)11Reserved:Note: If the caller supplies this value, H_MEMSTAT_CTRL returns
H_RT_PARM and the value of the ACA field in the returned memory instrumentation
configuration “config” parameter is 0b11.
Encoded values for Affinity Log Array (ALA) (flags/config)Bit 0Bit 1Comment00No operationDo not change, simply return current setting of ALA configuration01Disable ALA updatesNote: This setting will prevent the partition from seeing ALA data,
however, the platform may still accumulate such data for other purposes.10Enable ALA updatesNote: If the platform does not support this setting,
H_MEMSTAT_CTRL returns H_RT_PARM and the value of the ALA field in the
returned memory instrumentation configuration “config” parameter is 0b00.On enable the platform is not required to initialize the counters except to preclude a covert
channel as in the case of page reassignment between partitions.)11Reserved:Note: If the caller supplies this value, H_MEMSTAT_CTRL returns
H_RT_PARM and the value of the ALA field in the returned memory instrumentation
configuration “config” parameter is 0b11.
The Page Age Granule (PAG) and Affinity Log Sample Period (ALSP) fields (config only): are 8 bit fields
that define these periods in seconds. Note: On input these are reserved fields, any value other than all
zeros causes H_MEMSTAT_CTRL to return H_UNSUPPORTED_FLAG.Semantics:Returns H_UNSUPPORTED_FLAG if a “flags” parameter reserved bit is non-zero (contents of
R4 are undefined).Returns H_RT_PARM if a defined field of the “flags” parameter is either not supported or invalid
along with the current memory usage instrumentation configuration settings in register R4.Otherwise sets requested memory usage instrumentation configuration, returns H_Success along
with the current memory usage instrumentation configuration settings in register R4.H_RESET_MEMSTATSResets page age, affinity log, and/or PUT/HBA for up to 6 logical real pages as specified by the logical
page index parameters.Syntax:Parameters:Input:flags:bits 0::60: Reservedbit 61: HBA (History Bit Array): Set Bit Array to configured Initial Setting.bit 62: PAC (Page Age Counter)bit 63: ALA (Affinity Log Array)lpx1::lpx6: Logical page index(s) to be usedOutput:R3: Return codeR4::R10: ReservedSemantics:If (Flags AND HBA) and Reference History Bit Array not enabled, returns H_Not_Available.If (Flags AND ALA) and the Affinity Log Array not enabled, returns H_Not_Available.If (Flags AND PAA) and reference rate array not enabled, returns H_Not_Available.If (Flags AND ACA) and reference rate array not enabled, returns H_Not_Available.for each of lpx1..6
Exits for each if lpx is FFFF_FFFF_FFFF_FFFF.
If the lpx is not owned by the calling partition, return H_P(2..7)
If AMS mode and page is paged out move on as stats are reset on a page in
If (Flags AND ALA)
Reset the affinity log array for the lpx
If (Flags AND Page-Age)
Clear the page age counter for the lpx
If (Flags AND HBA)
Sets the configured Number of Ref Bits in the Reference History Bit Array for the lpx Return H_Success.H_RETURN_PAGEINFOThe H_RETURN_PAGEINFO hcall() returns the page usage information for a range of logical pages or a
list of specific logical pages. The results are returned in a 4K aligned buffer, the data for each logical page
occupying one 32 byte record, all other contents of the buffer are volatile and undefined.The range of logical pages is restricted to a single LMB boundary (the list of specific logical pages does not
have his restriction). If this restriction is violated, the H_RETURN_PAGEINFO hcall() returns H_P4.The RETURN of a page range might take longer than is allowable for a single call, or it might select more
pages than can be contained in the return buffer. Should either of these cases happen the
H_RETURN_PAGEINFO hcall() returns H_CONTINUE, along with the current results, and the value of
the lpx0 parameter for continuing the RETURN on a subsequent call from the termination point of the prior
call.The caller may specify a filter to further qualify the pages for which data is returned. For the CMO option
if the logical page has been paged out by the platform, the page is filtered out of the returned data.Syntax:Parameters:Input:flags:
H_MEMSTAT_CTRL Flag Layout0123012345670123456701234567012345670ArAccesses / Second1RESERVEDR
H_MEMSTAT_CTRL Flag Sub-field LayoutWordByteBitNameComment000-1ArAccess Rate Filter Control
0b00 Do not filter based upon Access Rate (ignore flags bits 2-31)
0b01 Reserved (Returns H_Parameter if this condition value is set)
0b10 Select if page access per second is greater than the value in bits 2::31. Note if the
specified value is greater than the MUI facilities’ maximum access rate capacity per the
“ibm,mui-ranges”
property this condition will not be met.
0b11 Reserved (Returns H_Parameter if this condition value is set)2-7Accesses per second1-3All10-2AllReservedReserved (Returns H_Parameter if this condition value is set)30-67R0: Returns page usage data for up to 5 pages specified one each in parameters
lpn0::lpn41: Returns page usage data for the range from the logical page in parameter lpn0
through lpn1
buffer: 4kB-aligned output bufferImplementation Note:
This buffer is output only and may be initialized with dcbz
instructions to preclude memory error handling.lpx0: The logical page number index for the first page.lpx1: For page Range option, the last logical page number index. For the list option,
and not the list terminator value (-1) continue to return page usage data for the
logical page number specified, else terminate the call.lpx1::lpx4: For the list option, and not the list terminator value (-1) continue to return page
usage data for the logical page number specified, else terminate the callOutput:R3: Return CodeR4: Number of matching entry records in destination bufferR5: For the range RETURN option, if the return code is H_CONTINUE this value is
the value of the lpn0 parameter for a subsequent call to continue the RETURN.R6::R10: Reserved
Buffer Record FormatField NameByte OffsetDescriptionlpx0::7The logical page number index of the selected pageaccess_rate8::15Accesses per secondaffinity_log16::23Eight single byte integers. One integer for each of the past
eight ALSP intervals. The non-zero integer value
representing one of the masters that referenced the page
during the period. A zero value indicating that valid data is
not available for the represented period. The associativity list
for the referencing master is found in the
“ibm,muiassociativity-mapping”
property.Reserved24::29flags30Bit 0 is a 1 if the page is AMS paged out (CMO option) else 0
Bit 1 is a 1 if the access_rate value is valid else 0
Bit 2 is a 1 if the affinity_log value is valid else 0
Bit 3 is a 1 if the age value is valid else 0
Bits 4::7 Reservedage31Page age in units of the Page Age Granule
Semantics:If any reserved flags bits are set return H_Parameter.Returns H_Function if function does not exist.Returns H_P2 if buffer is not 4k aligned or if the logical page number is outside of the caller’s
range.count = 0; /* number of entries matching in dest */if (Range_Search = 1) // lpn0..lpn1 range case
{
if(lpx0 is not a page owned by the calling partition) return H_P3;
if(lpx1 is less than lpx0 or outside of the LMB containing lpx0) return H_P4;
dest = buffer;
for(mem=lpx0; (mem<=lpx1) && (count < 128); mem++)
{
If (hit (mem, filter)) /* returns true if the logical page mem matches the selection criteria */
{ /* per the definition of the filter parameter above */
fillrecord (dest, mem) ; /* fills in a record per {cross reference to table {BRF}} */
count++ ;
Increment dest to next record (+32 bytes);
}
}
R4 = count ; R5 = mem;
If (mem > lpx1) then return H_Success else return H_CONTINUE;
}if (Range_Search = 0) // memory buffer case
{
dest = buffer; count = 0;
for each of lpx0..4
Exits for each if the LPX is FFFF_FFFF_FFFF_FFFF.
if (the LPX is not a page owned by the calling partition) return H_P(3..7);
if (hit (the LPX, filter)) /* returns true if the logical page mem matches the selection criteria */
{ /* per the definition of the filter parameter above */
fillrecord (dest, lpnx) ; /* fills in a record per {cross reference to table {BRF}} */
count++ ;
Increment dest to next record (+32 bytes);
}
R4 = count; return H_Success;
}H_BULK_READ_HBAH_BULK_READ_HBA returns the Reference History Bit Array entry for multiple pages in one call. The
returned values present the reference history for up to 64 HUC intervals. Starting with the high order bit
representing the most recent HUC interval, the returned value contains a one in the corresponding bit
position if the page was accessed during that interval. Due to hardware limitations, the system might not
have data for a full 64 intervals; in that case low order bits are zero filled.Syntax:Parameters:Input:flags: bits 0::63: Reservedlpx1::lpx6: The logical page number index to be used to index into appropriate position in
HBA. HBA value for a given LPX is returned in the same argument register.Output:R3: Return codeR4: ReservedR5: HBA corresponding to lpx1R6: HBA corresponding to lpx2R7: HBA corresponding to lpx3R8: HBA corresponding to lpx4R9: HBA corresponding to lpx5R10: HBA corresponding to lpx6Semantics:If Reference History is not enabled, returns H_Function.If a reserved flags bit is set return appropriate H_UNSUPPORTED_FLAG value.for each of lpx1..6
Exits for each if the LPX is FFFF_FFFF_FFFF_FFFF.
If lpn1..6 not owned by the calling partition, return H_P2..7 as appropriate
Load HBA corresponding to the LPX into the high-order bits of R5-R10
(corresponding to lpx1..6), zero out remaining lower bits of the register
and shift the register right (zero insert) by the difference between the current
time and the page’s PUT timestamp.
// end of for each lpx1..6Return H_Success.Coherent Platform FacilitiesThis section documents the hypervisor interface to optional coherent
platform facilities. If the platform supports a coherent platform facility, the
“ibm,hypertas-functions”
property of the
/rtas
node contains the function set specification “hcall-ca” and the following
hcall()s are supported.
H_ATTACH_CA_PROCESSThe architectural intent of this hcall is to attach a process element to a
coherent platform function. The process element describes the environment in
which a coherent platform function will operate for a given workload.
Syntax:
Process Token FormatBytes 0-3Bytes 4-7Platform firmware use (opaque to OS)CAIA process element index (128 byte offset into Scheduled Process Area)
Parameters:uint64 unit-address: Unit Address per the device tree
“reg”
property, element 0, of the coherent platform functionuint64 process-element-struct: Logical real address of the process
element structure. This memory must remain pinned and unchanged throughout
the duration of the H_ATTACH_CA_PROCESS call. All fields in the structure
have big-endian byte ordering and MSB 0 bit ordering.uint64 continue-token: Used to continue a process attach if H_Busy is
returned. Set to zero on first call. If H_Busy is returned then call
again but use the value returned in R4 from the previous call as the
value of continue-token.Semantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.Verify the process-element-struct parameter:
Verify that the process-element-struct is 8 byte aligned and
does not cross a 4096 byte boundary, else return H_Parameter.Verify that the Process Element structure version is supported,
else return H_Parameter.Verify that if the isPrivilegedProcess bit is set in the process element,
that the coherent platform function is allowed (via
“ibm,privileged-function”
F property), else return H_Authority.Verify that if the aurpValid bit is set to 1, that the coherent
platform function supports AUR (via
“ibm,supports-aur”
OF property), else return H_Parameter.Verify that if the csrpValid bit is set to 1, that the coherent platform
function supports CSRP (via
“ibm,supports-csrp”
OF property), else return H_Parameter.Verify there is adequate space to attach the process for the coherent
platform function else return H_Resource.Verify that the coherent platform function is in a state that allows
attaching of new processes and if necessary has been downloaded via
H_DOWNLOAD_CA_FUNCTION, else return H_State.Verify that the sum of the pslVirtualIsn and application
virtual ISN values are greater than or equal to the
“ibm,min-ints-per-process”
property for the coherent-platform-function and less than or equal to the
“ibm,max-ints-per-process”
property for the coherent-platform-function and the attaching of this
process does not violate the
“ibm,max-ints”
property for the coherent-platform-function, else return H_Parameter.Verify that the pslVirtualIsn is not already in use by another coherent
platform function under the coherent platform facility, if so return
H_Resource.Validate that the application virtual ISN values are valid and not
in use by another coherent platform function under the coherent platform
facility, and do not collide with the specified pslVirtualIsn, if so
return H_Resource.
Application virtual ISN values are calculated by adding the
base virtual ISN value found in the interrupt-ranges property of
the parent coherent platform facility node to the relative offset
(zero-based) of a bit in the bitmap that is set to 1. These
values are programmed into corresponding the CAIA process element
structure.Verify that all the virtual interrupts can be mapped into the CAIA
process element, else return H_Resource. It may be possible to attempt
to attach the process after detaching existing processes.Verify that the virtual interrupts provided will fit into the process
element entry, if not return H_Parameter.Disable the virtual interrupts provided in the process-element-struct.
The partition must use ibm,set-xive (with priority less than 0xFF) to
enable the virtual interrupt source after H_ATTACH_CA_PROCESS completes
successfully.Select a process element to use for the coherent platform function
and performs the procedure to attach a process as defined by the CAIA.
During this procedure, H_Busy or H_LongBusy will be returned if hcall
time limits are exceeded.Once the process element is attached as defined by the CAIA, return
H_Success, R4 contains the process token and if
“ibm,process-mmio”
is set to 1, R5 is the MMIO address, R6 is the MMIO length.Following a reset of the coherent platform facility or coherent platform
function, platform firmware guarantees that the upper 4 byte portion of
the returned process token will be different than it was for any process
token returned since the previous reset.H_DETACH_CA_PROCESSThe architectural intent of this hcall is to detach a process element
from a coherent platform function. This hcall will remove the workload or p
rocess element that was attached using H_ATTACH_CA_FUNCTION.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg”
property of the coherent platform functionuint64 process-token: process identifier token for the attached
process returned in R4 on H_Success return from H_ATTACH_CA_PROCESS call.uint64 continue-token: Used to continue a process detach if H_Busy
is returned. Set to zero on first call. If H_Busy is returned then
call again but use the value returned in R4 from the previous call
as the value of continue-token.Semantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return
H_Parameter.Verify that the process-token is currently an attached process to
the coherent platform function, else return H_Parameter.Verify that the coherent platform function is in an error state that
allows detaching processes, else return H_State.If
“ibm,process-mmio”
is set to 1, verify that there are no existing
mappings in the page table for the process
MMIO space, else return H_Resource.If the process is not completed or suspended, the process is
terminated using the process terminate procedure in the CAIA. During
this process the platform can return H_Busy or H_LongBusy and the OS
is responsible for calling back until a non-busy return code is returned.Remove the process from the coherent platform function process
element list according to the process remove procedure defined in the
CAIA. During this process the platform can return H_Busy or H_LongBusy
and the OS is responsible for calling back until a non-busy return code
is returned.Invalidation of the SLB and TLB for the process being detached is
performed. During this process the platform can return H_Busy or
H_LongBusy and the OS is responsible for calling back until a non-busy
return code is returned.If the hardware encounters an error while detaching the process,
H_Hardware is returned.H_Success is returned.H_CONTROL_CA_FUNCTIONThis H_CONTROL_CA_FUNCTION hypervisor call allows the partition to
manipulate or query certain coherent platform function behaviors.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg”
property
of the coherent platform facilityuint64 operation: operation to perform to the coherent platform
facility. Valid values are:
Reset: operation = 1, perform a reset to the coherent platform
function, this is used when the partition needs to reset the coherent
platform function to a clean state. All attached processes and state
are cleared by firmware after this reset.Suspend Process: operation = 2, suspend a process from being
executedResume Process: operation = 3, resume a process to be executedRead Error State: operation = 4, read the error state of the
coherent platform functionGet Error Buffer: operation = 5, collect the AFU error buffer
for the coherent platform function.Get Function Configuration Record: operation = 6, collect
configuration record for the coherent platform functionGet Function Download Status: operation = 7, query to return
download status of a programmable coherent platform function.Terminate Process: operation = 8, terminate the process
before completionCollect VPD: operation = 9, collect VPD for the coherent
platform function.Get Function Error Interrupts: operation = 11, read the
function-wide error data based on an interrupt from
“ibm,function-error-interrupt”Acknowledge Function Error Interrupts: operation = 12,
acknowledge function-wide error data based on an interrupt from
“ibm,function-error-interrupt”Get Error Log: operation = 13, retrieve the Platform Log ID
(PLID) of an error log containing error data for the coherent
platform function. This is used after a Temporary Unavailable or
Permanently Unavailable Error State.uint64 parameter1: parameter 1 for operations, meaning changes based on the operation.uint64 parameter2: parameter 2 for operations, meaning changes based on the operation.uint64 parameter3: parameter 3 for operations, meaning changes based on the operation.uint64 parameter4: parameter 4 for operations, meaning changes based on the operation.uint64 continue-token: Used to continue an operation if H_Busy is returned.
Set to zero on first call. If H_Busy is returned then call again but use the value
returned in R4 from the previous call as the value of continue-token.
OperationParametersResetNoneSuspect ProcessParameter1 = process-token as returned from
H_ATTACH_CA_PROCESS when process was attached.Resume ProcessParameter1 = process-token as returned from
H_ATTACH_CA_PROCESS when process was attached.Read Error StateNoneGet Error BufferParameter1 = byte offset into error buffer to retrieve,
valid values are between 0 and (ibm,error-buffer-size – 1)Parameter2 = 4K aligned real address of error buffer,
to be filled inParameter3 = length of error buffer, valid values are
4K or lessGet Functional Configuration RecordParameter1 = # of configuration record to retrieve,
valid values are between 0 and (ibm,#config-records – 1)Parameter2 = byte offset into configuration record
to retrieve, valid values are between 0 and
(ibm,config-record-size – 1)Parameter3 = 4K aligned real address of configuration
record buffer, to be filled inParameter4 = length of configuration buffer, valid values are 4K or lessGet Function Download StatusNoneTerminate ProcessParameter1 = process-token as returned from
H_ATTACH_CA_PROCESS when process was attached.Collect VPDParameter1 = # of VPD record to retrieve, valid
values are between 0 and (ibm,#config-records – 1)Parameter2 = 4K naturally aligned real buffer
containing scatter/gather list entries. All fields in the scatter/gather list have big-endian byte ordering.Parameter3 = number of entries in the scatter/gather
list, valid values are between 0 and 256Get Function Error InterruptsNoneAcknowledge Function Error InterruptsParameter1 = value to write to the function-wide
error interrupt registerGet Error LogNoneSemantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.If operation is Reset:
If coherent platform function is in Temporarily Unavailable or
Permanently Unavailable error state or is already performing a reset,
return H_State.If partition is not allowed to perform a Reset
(“ibm,privileged-function”
property is 0 or not present), return H_Authority.If coherent platform function has
“ibm,process-mmio”
property set to 1 and partition has any page table mappings existing
for the function, return H_Resource.If coherent platform function is in Normal error state, set to
Disabled error state.Terminate and remove all process elements that were attached
via H_ATTACH_CA_PROCESS. If the termination takes longer than is
allowed for an hcall, R4 is set to the continue-token and H_Busy
or H_LongBusy are returned.If allowed, perform a reset (disable AFU, PSL suspend, PSL
purge, TLB invalidate, SLB invalidate) of the coherent platform
function using CAIA procedures. If the reset takes longer than
is allowed for an hcall, R4 is set to the continue-token and H_Busy
or H_LongBusy are returned.After the reset, if the coherent-platform-function has the
“ibm,programmable”
property set to 1, a download is required via H_DOWNLOAD_CA_FUNCTION.
The Get Function Download Status operation can be used to query
the download state.If the coherent-platform-function does not have the
“ibm,programmable”
property or it is set to 0, the AFU is enabled.If the reset fails while communicating with the hardware,
return H_Hardware.Reset the error log data for the Get Error Log operation.Set coherent platform function Error State to Normal and
return H_SuccessIf operation is Suspend Process:
If the coherent platform function is not in a Normal Error State,
return H_State.If the coherent platform function does not support suspending
processes, return H_Function.If the process associated with the process token cannot be found,
return H_Parameter.If the process is not able to be suspended or is already suspended,
return H_State.The process associated with the process-token is suspended via
the procedure defined in the CAIA. If the suspend takes longer
than is allowed for an hcall, R4 is set to the continue-token and
H_Busy or H_LongBusy are returned.If the Suspend Process procedure encounters a hardware failure,
return H_Hardware.Return H_Success.If operation is Resume Process:
If the coherent platform function is not in a Normal Error State,
return H_State.If the coherent platform function does not support resuming
processes, return H_Function.If the process associated with the process token cannot be found,
return H_Parameter.If the process not suspended or resume isn't possible, return
H_State.The process associated with the process-token is resumed via
the procedure defined in the CAIA. If the resume takes longer than
is allowed for an hcall, R4 is set to the continue-token and H_Busy
or H_LongBusy are returned.If a hardware error occurs during the Resume Process operation,
return H_Hardware.Return H_Success.If operation is Read Error State:
Platform firmware checks the error state of the coherent platform
function. If already in an error state, H_Success is returned and
R4 contains the error state.Platform firmware checks for errors on the coherent platform
function. If errors exist, error recovery is entered and H_Success
is returned and R4 contains the error state.If operation is Get Error Buffer:
If parameter2 does not describe a valid 4K aligned real address,
return H_Parameter.If parameter3 is greater than 4K, return H_Parameter.If parameter1 plus parameter3 is greater than or equal to
“ibm,error-buffer-size”,
return H_Parameter.If the coherent platform function is in a Temporarily Unavailable
or Permanently Unavailable state, return H_State.Platform firmware collects the error data buffer from the AFU
descriptor associated with the coherent platform function and places
it in the partition buffer described by parameter2 and parameter3.If the Get Error Buffer operation exceeds the time allowed for
an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy
is returned.If the error buffer cannot be read from the hardware due to a
hardware problem, return H_Hardware.Return H_Success.If operation is Get Function Configuration Record:
If parameter1 does not describe a valid configuration record number,
return H_Parameter.If parameter3 does not describe a valid 4K aligned real address,
return H_Parameter.If parameter4 is greater than 4K, return H_Parameter.If parameter2 plus parameter4 is greater than or equal to
“ibm,config-record-size”,
return H_Parameter.If the coherent platform function is not in a Normal Error State,
return H_State.If platform firmware cannot retrieve the configuration record
from the coherent platform function, return H_Function.If platform firmware cannot retrieve the configuration record
due to the coherent platform function not in a downloaded state, r
eturn H_NOT_AVAILABLE.Platform firmware collects the configuration record from the
coherent platform function and places it in the partition buffer
described by parameter3 and parameter4. The data is stored as a
byte stream; the first byte in the buffer corresponds to byte 0 of
the configuration record.If the Get Function Configuration Record operation exceeds the
time allowed for an hcall, R4 is set to the continue-token and
H_Busy or H_LongBusy is returned.If the configuration record cannot be read from the hardware,
due to a hardware problem, return H_Hardware.Return H_Success.If operation is Get Function Download Status:
If coherent platform function does not support download,
return H_Function.If the partition does not have the authority to get download status
(“ibm,privilegedfunction”
property is 0 or not present), return H_Authority.If the coherent platform function is not in a Normal or
Disabled Error State, return H_State.Platform firmware returns the download status in R4, where
0 = no-download-found and 1 = download-found.Return H_Success.If operation is Terminate Process:
If the coherent platform function is not in a Normal Error State,
return H_State.If the coherent platform function does not support terminating
processes, return H_Function.If the process associated with the process token cannot be found,
return H_Parameter.If the process has already completed, return H_State.The process associated with the process-token is terminated
via the procedure defined in the CAIA. If the attempt to terminate
the process takes longer than is allowed for an hcall, R4 is set
to the continue-token and H_Busy or H_LongBusy are returned.If a hardware error occurs during the Terminate Process operation,
return H_Hardware.Return H_Success.If the operation is Collect VPD:
If parameter1 does not describe a valid VPD record number,
return H_Parameter.If parameter2 does not describe a valid 4K aligned real address,
return H_Parameter.If parameter3 is greater than 256, return H_Parameter.If a scatter/gather list entry specifies an invalid address, or
specifies a buffer that crosses a page boundary, return H_SG_LIST.If the coherent platform function is not in a Normal Error State,
return H_State.If platform firmware cannot retrieve the VPD from the coherent
platform function, return H_Function.If platform firmware cannot retrieve the VPD due to the coherent
platform function not in a downloaded state, return H_NOT_AVAILABLE.Platform firmware collects the VPD from the coherent platform
function and places it in the partition buffer described by parameter2
and parameter3. The data will be truncated as necessary to fit in the
provided buffer. The data is stored as a bytestream; the first byte
in the buffer corresponds to byte 0 of the VPD.If the Collect VPD operation exceeds the time allowed for an hcall,
R4 is set to the continue-token and H_Busy or H_LongBusy is returned.If a hardware error occurs during the Collect VPD operation, r
eturn H_Hardware.Return H_Success, and R4 is set to the length of the available
VPD, which may be different than the amount of data actually stored
in the partition buffer. It may also be different than the value
reported in the
“ibm,vpd-size”
property, though it will not be greater than that. A length of
0 means no VPD has been provided for the coherent platform function.If the operation is Get Function Error Interrupts:
If the coherent platform function is not in a Normal Error or
Disabled State, return H_State.If the coherent platform function does not support Get Function
Error Interrupts, return H_Function.If the Function Error Interrupts cannot be retrieved from the
hardware, return H_Hardware.Platform firmware returns the value of Function Error Interrupts
read from hardware in R4.Return H_Success.If the operation is Acknowledge Function Error Interrupts:
If the coherent platform function is not in a Normal or Disabled
Error State, return H_State.If the coherent platform function does not support Acknowledge
Function Error Interrupts, return H_Function. Acknowledge Function Error Interrupts using the value in
parameter1.If the Acknowledge Function Error Interrupts cannot be sent
to the hardware, return H_Hardware.Return H_Success.If operation is Get Error Log:
If the coherent platform function is not in Disabled or
Permanently Unavailable Error State, return H_State.If applicable, platform firmware writes the Platform Log ID
(PLID) in R4 for the error log that is associated with the cause
of the Temporarily Unavailable or Permanently Unavailable Error State.
This data is used to correlate errors between the platform owned
resource and the coherent platform function. If there is no
associated error log to reference, platform firmware writes zero to R4.Return H_Success.If operation is unknown, return H_Not_Found.H_COLLECT_CA_INT_INFOThe architectural intent of this hcall is to collect interrupt info about
a coherent platform function after an interrupt occurred.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg” property
of the coherent platform facilityuint64 process-token: process identifier token for the attached
process returned in R4 on H_Success return from H_ATTACH_CA_PROCESS call.Return Values:R4 contains the PSL_DSISR_An register value defined in the CAIA on
H_Success.R5 contains the PSL_DAR_An register value defined in the CAIA on
H_Success.R6 contains the PSL_DSR_An register value defined in the CAIA on
H_Success.R7 contains the PSL_PID_An in the upper 32 bits and PSL_TID_An
register in the lower 32 bits.R8 contains the AFU_ERR_An register value defined in the CAIA on
H_Success.R9 contains the PSL_ErrStat_An register value defined in the
CAIA on H_Success.R10 contains a handle for the process element that incurred
the fault on H_Success.Semantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.Verify that the process-token parameter is valid else return
H_Parameter.Verify that the coherent platform function is in the proper state
to read interrupt information else return H_State.Platform firmware reads the values of PSL_DSISR_An, PSL_DAR_An,
PSL_DSR_An, PSL_DSR_An, PSL_PID_An, PSL_TID_An, AFU_ERR_An and
PSL_ErrStat_An as defined by the CAIA and populates the return registers.
AFU_ERR_An value is only valid if PSL_DSISR[AE] is 1 or PSL_SERR_An[AE]
is 1. PSL_ErrStat_An value is only valid if PSL_DSISR[PE] is 1. If any
of the reads fail from the hardware H_Hardware is returned and none of
the return registers should be considered valid.H_Success is returned.H_CONTROL_CA_FAULTSThe architectural intent of this hcall is to control the operation of a
coherent platform function after a fault occurs.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg” property
of the coherent platform facilityuint64 operation: operation to perform to the coherent platform
facility. Valid values are:
Respond to page fault - PSL: operation = 1.Respond to page fault - AFU: operation = 2.uint64 parameter1: parameter 1 for operations, meaning changes based on the operation.uint64 parameter2: parameter 2 for operations, meaning changes based on the operation.uint64 parameter3: parameter 3 for operations, meaning changes based on the operation.uint64 parameter4: parameter 4 for operations, meaning changes based on the operation.OperationParametersRespond to page fault - PSLParameter1 = process-token as returned from H_ATTACH_CA_PROCESSParameter2 = control-mask
bits 0-59: reservedbit 60: acknowledge non-translation fault interruptbit 61: continue execution, current translation
fault is not resolved and must be retried at a later timebit 62: restart function and indicate address errorbit 63: restart the transaction that caused the
translation faultParameter3 = reset-mask
bit 0-62: reservedbit 63: reset fault bits for a PSL level process
error (PSL_DSISR_An[PE] is set)Respond to page fault - AFUParameter1 = process-tokenParameter2 = process element handle returned from
H_COLLECT_CA_INT_INFO.Parameter3 = effective addressParameter4 = resolution, valid values are:
0x0 -- Page Fault Resolved0x1 -- Addressing Error0x2 -- Protection Fault on a Read operation0x3 -- Protection Fault on a Write operationSemantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.If operation is Respond to page fault - PSL:
Verify that the process-token parameter is valid else return
H_Parameter.Verify that the coherent platform function is in a valid state
else return H_State.Using the control-mask set the corresponding bits in
PSL_TFC_An as defined by CAIA. Only bits that are set are written.
If no bits are set, no changes are performed. If the setting of
the bits in the hardware encounters an error, return H_Hardware.If bit 63 of the reset-mask is set, clear the PSL_ErrStat_An
bits by reading the register and writing back the value read. If
this operation encounters an error with the hardware, return
H_Hardware.Perform a read from PSL_TFC_An and place corresponding values
in R4. If the read fails, return H_Hardware.Return H_Success with the following in R4:
bits 0-60: reservedbit 61: function waiting to continue bit 62: address error pendingbit 63: command reissue pendingIf operation is Respond to page fault - PSL:
Verify that the process-token parameter is valid else return
H_Parameter.Verify that the resolution parameter is valid else return
H_Parameter.Verify that the coherent platform function is in a valid state
else return H_State.Verify that the coherent platform function supports paged
resolution response (via
“ibm,supports-prr”
OF property), else return H_Function.Write the effective address and resolution to the corresponding
fields in the PRR registers of the AFU. If this operation encounters
an error with the hardware, return H_Hardware.Return H_Success.If operation is unknown, return H_Not_Found.H_DOWNLOAD_CA_FUNCTIONThe architectural intent of this hcall is to provide platform support for
downloading an application image to the coherent platform function. The
partition provides download data to the platform via an image scatter/gather
list. The scatter/gather list can architecturally describe up to 1 megabyte
of data (256 entries of 4096 bytes each). The OS must subdivide the application
image into chunks that are each 1 megabyte or less in size, and call H
_DOWNLOAD_CA_FUNCTION for each of those chunks.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg” property
of the coherent platform facilityuint64 scatter-gather-list-address: 4K naturally aligned real buffer
containing scatter/gather list entries. All fields in
the scatter/gather list and all fields in the image header have
big-endian byte ordering.uint64 num-scatter-gather-list-entries: number of entries
in the scatter/gather listuint64 continue-token: Used to continue an operation if H_Busy or
H_CONTINUE is returned. Set to zero on first call. If H_Busy or
H_CONTINUE is returned then call again but use the value returned in
R4 from the previous call as the value of continue-token.
Image Scatter/Gather List Entry Format8 byte logical real address of buffer8 byte buffer length in bytes (max length is 4096 bytes)
Image Scatter/Gather List FormatLogical real address of buffer 0Buffer 0 length in bytes)Logical real address of buffer 1Buffer 1 length in bytes)......Logical real address of buffer N-1Buffer N-1 length in bytes)Logical real address of buffer NBuffer N length in bytes)
Application Image Header, Version 1NameOffsetLengthDescriptionVersion02Version of the AFU image header, value = 1Function Number22Physical function number that the application usesApplication ID42Application identifierReserved62Set to zero.Vendor ID82PCI Vendor ID of the adapter owning the coherent
platform functionDevice ID102PCI Device ID of the adapter owning the coherent
platform functionSubsystem Vendor ID122PCI Subsystem Vendor ID of the adapter owning the coherent
platform functionSubsystem ID142PCI Subsystem ID of the adapter owning the coherent
platform functionImage Offset168Offset to the application image bitstreamImage Length248Length of the application image bitstreamVerification Type322Type of verification required for image:
1 = Bounds CheckAll other values reservedReserved346Set to zeroCAIA Version402Minimum CAIA Version required by this application imagePSL Revision422Minimum PSL Revision required by this application imageReserved4484Set to zeroImage BitstreamXYApplication image bitstream, where X = Image
Offset and Y = Image Length
Return Values:R4 on H_Busy or H_LongBusy or H_CONTINUE contains the continue-token
to be used on the next callSemantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.If the coherent-platform-facility cannot be downloaded at this
time due to a resource constraint, H_Resource is returned.If the coherent platform facility does not support download,
return H_Function.If the coherent platform function is already downloaded, or if a
download is in progress, return H_State.If the partition does not have the authority to perform download
(“ibm,privileged-function”
property is 0 or not present), return H_Authority.If the coherent platform facility is in a Temporary Unavailable
Error State or has attached processes, return H_State.If the scatter-gather-list-address does not describe a 4K byte
naturally aligned buffer, return H_Parameter.If the Application Image Header version is not supported by platform
firmware, return H_BAD_DATA.If necessary, platform firmware disables the coherent platform facility
from operation.For each entry in the scatter/gather list described by
scatter-gather-list-address:
Platform firmware validates address and length in the scatter/gather
list entry. The buffer described should not cross a 4K page boundary.
If invalid, returns H_SG_LIST.Platform firmware copies data from the scatter/gather list entry
to the platform firmware buffer.Platform firmware verifies the image bitstream data chunk in the
platform buffer. If platform firmware determines the image bitstream
data chunk is not valid, return H_BAD_DATA. During this operation,
H_Busy or H_LongBusy can be returned due to hcall maximum time
limits, the partition should call back, until a non-busy return
code is returned.Platform firmware performs the download for the
coherent platform facility, using the image bitstream data chunk.
During this operation, H_Busy or H_LongBusy can be returned due to
hcall maximum time limits, the partition should call again, until a
non-busy return code is returned.If the coherent platform facility does not accept the download
of the image bitstream data chunk or an error occurs while
communicating with the hardware, H_Hardware is returned.If hcall time limit is exceeded, but more data is left to
copy in the current scatter/gather list, H_Busy or H_LongBusy is
returned. The partition should call back with the current
scatter/gather list.Once every entry in the current scatter/gather list is copied,
platform firmware returns H_CONTINUE. The partition then calls back
with a new scatter/gather list for the next chunk of image data and the
previous steps are repeated for each new list. This is repeated as long
as H_CONTINUE is returned.The CAIA AFU descriptor is read for the downloaded AFU, if any
fields in the AFU descriptor are not compatible with the PSL,
H_UNSUPPORTED is returned.If the Download operation completes successfully, if necessary,
platform firmware re-enables coherent platform function for operation.H_Success is returned.Any error in the above steps will cause the download to be aborted.
The partition must retry H_DOWNLOAD_CA_FUNCTION, starting with the
Application Image header in order to complete the download.After H_DOWNLOAD_CA_FUNCTION is performed, the partition should call
ibm,update-nodes and ibm,update-properties
to receive the current configuration for the coherent platform facility.When H_DOWNLOAD_CA_FUNCTION is first called, some AFU or adapter resources
may be reserved for use during the download sequence, which may span
multiple H_DOWNLOAD_CA_FUNCTION calls, until the image download is
complete as indicated by a return of H_SUCCESS. When H_CONTINUE is returned,
indicating that more data is needed for the complete AFU image, the OS must call
H_DOWNLOAD_CA_FUNCTION again within 1 milliseconds, or the download sequence will
be abandoned and the OS may need to reset the AFU and restart the
download sequence from the beginning.H_DOWNLOAD_CA_FACILITYThe architectural intent of this hcall is to provide platform support for
downloading a base adapter image to the coherent platform facility, and for
validating the entire image after the download. The partition provides download
data to the platform via an image scatter/gather list. The scatter/gather list
can architecturally describe up to 1 megabyte of data (256 entries of 4096
bytes each). The OS must subdivide the base adapter image into chunks that are
each 1 megabyte or less in size, and call H_DOWNLOAD_CA_FACILITY for each of
those chunks.Base adapter image download requires two separate operations. The first is
the download operation, which processes the entire image, possibly returning
H_CONTINUE a number of times, and completing when H_Success is returned. The
second is the validate operation, which again processes the entire image with
a number of H_CONTINUE returns until it completes with H_Success. The base
adapter image is not usable until both operations have completed successfully.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg” property
of the coherent platform facilityuint64 operation: operation to perform to the coherent platform
facility. Valid values are:
Download: operation = 1, the base image in the coherent platform
facility is first erased, and then programmed using the image supplied
in the scatter/gather list.Validate: operation = 2, the base image in the coherent platform
facility is compared with the image supplied in the scatter/gather list.uint64 scatter-gather-list-address: 4K naturally aligned real buffer
containing scatter/gather list entries. The format of the scatter/gather
list is the same as for the H_DOWNLOAD_CA_FUNCTION hcall. All fields in
the scatter/gather list and all fields in the image header have
big-endian byte ordering.uint64 num-scatter-gather-list-entries: number of block list entries
in the scatter/gather listuint64 continue-token: Used to continue an operation if H_Busy or
H_CONTINUE is returned. Set to zero on first call. If H_Busy or
H_CONTINUE is returned then call again but use the value returned in
R4 from the previous call as the value of continue-token.
Base Adapter Image Header, Version 1NameOffsetLengthDescriptionVersion02Version of the base adapter image header, value = 1Reserved26Set to zero.Vendor ID82PCI Vendor ID of the coherent platform facilityDevice ID102PCI Device ID of the coherent platform facilitySubsystem Vendor ID122PCI Subsystem Vendor ID of the coherent platform facilitySubsystem ID142PCI Subsystem ID of the coherent platform facilityImage Offset168Offset to the base adapter image bitstreamImage Length248Length of the base adapter image bitstreamReserved3296Set to zeroImage BitstreamXYBase adapter image bitstream, where X = Image
Offset and Y = Image Length
Return Values:R4 on H_Busy or H_LongBusy or H_CONTINUE contains the continue-token
to be used on the next callSemantics:Verify that coherent platform facilities are licensed to be used,
else return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.If the coherent-platform-facility cannot be downloaded at this
time due to a resource constraint, H_Resource is returned.If the coherent platform facility does not support download,
return H_Function.If a download is in progress for the coherent
platform facility, return H_State.If the partition does not have the authority to perform download
(“ibm,privileged-function”
property is 0 or not present), return H_Authority.If the coherent platform facility is in a Temporary Unavailable E
rror State, return H_State.If the scatter-gather-list-address does not describe a 4K byte
naturally aligned buffer, return H_Parameter.If the Base Adapter Image Header version is not supported by platform
firmware, return H_BAD_DATA.If necessary, platform firmware disables the coherent platform facility
from operation.For each entry in the scatter/gather list described by
scatter-gather-list-address:
Platform firmware validates address and length in the scatter/gather
list entry. The buffer described should not cross a 4K page boundary.
If invalid, returns H_SG_LIST.Platform firmware copies data from the scatter/gather list entry
to the platform firmware buffer.Platform firmware verifies the image bitstream data chunk in the
platform buffer. If platform firmware determines the image bitstream
data chunk is not valid, return H_BAD_DATA. During this operation,
H_Busy or H_LongBusy can be returned due to hcall maximum time
limits, the partition should call back, until a non-busy return
code is returned.Platform firmware performs the download or validate operation for the
coherent platform facility, using the image bitstream data chunk.
During this operation, H_Busy or H_LongBusy can be returned due to
hcall maximum time limits, the partition should call again, until a
non-busy return code is returned.If the coherent platform facility does not accept the download
of the image bitstream data chunk or an error occurs while
communicating with the hardware, H_Hardware is returned.If hcall time limit is exceeded, but more data is left to
copy in the current scatter/gather list, H_Busy or H_LongBusy is
returned. The partition should call back with the current
scatter/gather list.Once every entry in the current scatter/gather list is copied,
platform firmware returns H_CONTINUE. The partition then calls back
with a new scatter/gather list for the next chunk of image data and the
previous steps are repeated for each new list. This is repeated as long
as H_CONTINUE is returned.If the validate operation completes successfully, platform
firmware re-enables coherent platform facility for operation if necessary.H_Success is returned.Any error in the above steps will cause the download to be aborted.
To complete the download, the partition must retry both H_DOWNLOAD_CA_FACILITY
operations, including the Base Adapter Image header for each operation.After H_DOWNLOAD_CA_FACILITY is performed, the partition should call
ibm,update-nodes and ibm,update-properties
to receive the current configuration for the functions under this
coherent platform facility.When H_DOWNLOAD_CA_FACILITY is first called, some adapter resources
may be reserved for use during the download sequence, which may span
multiple H_DOWNLOAD_CA_FACILITY calls, until the image download is
complete as indicated by a return of H_SUCCESS. When H_CONTINUE is returned,
indicating that more data is needed for the complete image, the OS must call
H_DOWNLOAD_CA_FACILITY again within 3 seconds, or the download sequence may
be abandoned and the OS may need to reset the facility and restart the
download sequence from the beginning.H_CONTROL_CA_FACILITYThis H_CONTROL_CA_FACILITY hypervisor call allows the partition to manipulate
or query certain coherent platform facility behaviors.Syntax:Parameters:uint64 unit-address: Unit Address per the device tree
“reg” property
of the coherent platform facilityuint64 operation: operation to perform to the coherent platform
facility. Valid values are:
Reset: operation = 1, initiate a reset to the coherent platform
facility, this is used when the partition needs to reset the
coherent platform facility and all of its child coherent platform
functions to a clean state. All attached processes and state are
cleared by firmware after this reset. If a new base adapter image
has been downloaded, that image will be activated.Collect VPD: operation = 2, collect VPD for the coherent
platform facility.uint64 parameter1: parameter 1 for operations, meaning changes based on the operation.uint64 parameter2: parameter 2 for operations, meaning changes based on the operation.uint64 parameter3: parameter 3 for operations, meaning changes based on the operation.uint64 parameter4: parameter 4 for operations, meaning changes based on the operation.uint64 continue-token: Used to continue an operation if H_Busy is returned.
Set to zero on first call. If H_Busy is returned then call again but use the value
returned in R4 from the previous call as the value of continue-token.
OperationParametersResetNoneCollect VPDParameter1 = 4K naturally aligned real buffer containing
scatter/gather list entries. All fields in the scatter/gather
list have big-endian byte ordering.Parameter2 = number of entries in the scatter/gather
list, valid values are between 0 and 256Semantics:Verify that coherent platform facilities are licensed to be used, else
return H_Authority.Verify that the unit-address parameter is valid else return H_Parameter.If operation is Reset:
If coherent platform facility is in Temporarily Unavailable error
state or is already performing a reset, return H_State.If partition is not allowed to perform a Reset
(“ibm,privileged-facility”
property is 0 or not present), return H_Authority.Set Temporarily Unavailable error state for the coherent
platform facility and all child coherent platform functions.Initiate reset of the coherent platform facility.If the Reset operation exceeds the time allowed for an hcall,
R4 is set to the continue-token and H_Busy or H_LongBusy is returned.Return H_SuccessIf operation is Collect VPD:
If parameter1 does not describe a valid 4K aligned real address, return H_Parameter.If parameter2 is greater than 256, return H_Parameter.If a scatter/gather list entry specifies an invalid address,
or specifies a buffer that crosses a page boundary, return H_SG_LIST.If the coherent platform facility is not in a Normal Error State,
return H_State.If platform firmware cannot retrieve the VPD from the coherent
platform facility, return H_Function.Platform firmware collects the VPD from the coherent platform
facility and places it in the partition buffer described by
parameter1 and parameter2. The data will be truncated as necessary
to fit in the provided buffer. The data is stored as a bytestream;
the first byte in the buffer corresponds to byte 0 of the VPD.If the Collect VPD operation exceeds the time allowed for an hcall,
R4 is set to the continue-token and H_Busy or H_LongBusy is returned.If a hardware error occurs during the Collect VPD operation,
return H_Hardware.Return H_Success, and R4 is set to the length of the available VPD,
which may be different than the amount of data actually stored in the
partition buffer. It may also be different than the value reported in the
“ibm,vpd-size”
property, though it will not be greater than that. A length of 0
means no VPD has been provided for the coherent platform facility.If operation is unknown, return H_Not_Found.