Logical Partitioning Option

Overview The Logical PARtitioning option (LPAR) simultaneously runs one or more copies of a single OS or multiple heterogeneous LoPAR compliant OSs on a single LoPAR platform. A partition, within which an OS image runs, is assigned a non-overlapping sub-set of the platform’s resources. These platform-allocatable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and I/O adapter bus slots. Partition firmware loaded into each partition generates an OF device tree to represent the resources of the partition to the OS image. Allocatable resources are directly controlled by an OS. This architecture restricts the sharing of allocatable resources between partitions; to do so requires the use of optional facilities presented in . Platform resources, other than allocatable resources mentioned above, that are represented by OF nodes in the device tree of more than one partition (for example, memory controllers and processor host bridges) are marked ‘used-by-rtas’. Since one of the main purposes of partitioning is isolation of the OSs, the ability to manage real system resources that are common to all partitions is modified for the LPAR option. This means that partition use of RTAS functions which ostensibly use real system resources such as power and time-of-day clocks are buffered from actual manipulation of those resources. The RTAS is modified for LPAR, and has hypervisor support, to virtualize real resources for the partitions. Operational management of the platform moves to a Hardware Management Console (HMC) which is an application, either local or remote, that manages platform resources with messages to the hypervisor rather than being under direct control of a partition’s OS. Platforms supporting LPAR, contain Power PC processors that support the hypervisor addressing mode, in which the physical address is equal to the effective address and all processor resources are available. The “Real Mode” addressing mode, in these processors, is redefined to translate and limit the physical addresses that the processor can access and to restrict access to certain address translation controlling processor resources. The virtual addressing mode is unchanged. See the (level 2.0 and beyond) for the architecture extensions required for the processor. The I/O subsystems of these platforms contain I/O bridges that restrict the bus addresses that I/O adapters can access. These restricted bus addresses are subsequently translated through the Translation Control Entry (TCE) mechanism to restrict Direct Memory Accesses (DMAs) by I/O devices. This restriction is to system memory allocated to a partition and managed by the OS image owning the device. The interrupt subsystem of these platforms is enhanced with multiple (one per partition) global interrupt queues to direct interrupts to any processor assigned to the I/O adapter’s owning OS image. Logical Partitioning platforms employ a unique firmware component called the hypervisor (that runs in hypervisor mode) to manage the address mapping and common platform hardware facilities, thereby ensuring isolation between partitions. The OS uses new hypervisor interfaces to manage the partition’s page frame and TCE tables. The platform firmware utilizes implementation dependent interfaces to platform hardware common to all partitions. Thus, a system with LPAR has different OS support than a system without LPAR. In addition to generating per partition device trees, the OF component of a logically partitioned platform manages the initial booting and any subsequent booting of the specific OS image associated with each partition. The NVRAM of a platform contains configuration variables, policy options, and working storage that is protected from accesses that might adversely affect one or more partitions and their OS images. The hypervisor firmware component restricts an OS image’s access to NVRAM to a region assigned to its partition. This may restrict the number of partitions. Most system management on systems without LPAR is performed by OS based applications that are given access to modify the platform’s configuration variables, policy options and firmware flash. For various Reliability Availability and Serviceability (RAS) reasons, LoPAR Logical Partitioning platforms do not restrict platform operational management functions to applications running on a preferred partition or OS image. Access to these Operational Management facilities is provided via a Support Processor communication port that is connected to an HMC and/or through a communications port that is connected through a PCI adapter in a partition. The HMC is a set of applications running in a separate stand-alone platform or in one of the platform’s partitions. These HMC applications are required to establish the platform’s LPAR configuration, however, the configuration is stored in the platform and, therefore, the HMC is not required to boot or operate the platform in a pre-configured non-error condition.

Real Mode Accesses When the OS controlling an LPAR runs with address translation turned off (MSRDR or MSRIR bit(s) =0) (real mode) the LPAR hardware translates the memory addresses to an LPAR unique area known as the Real Mode Area (RMA). When control is initially passed to the OS from the platform, the RMA starts at the LPAR's logical address 0 and is the first logical memory block reported in the LPAR’s device tree. In general, the RMA is a subset of the LPAR's logical address space. Attempting a non relocated access beyond the bounds of the RMA results in an storage interrupt (ISI/DSI depending upon instruction or data reference). The RMA hardware translation scheme is platform dependent. The options are given below.

Offset and Limit Registers The Offset RMA architecture checks the LPAR effective address against the contents of an RMOL register allowing the access, after adding an LPAR specific offset to form the real address, if the effective address is less, else signaling a protection exception.

Reserved Virtual Addresses The platform may map the RMA through the hashed paged table via a reserved range of virtual addresses. This mapping from the effective address is done by setting the high order virtual address bits corresponding to the VSID to the 0b00 || 0x001FFFFFF 1 TB VSID value. This virtual address is then translated as other virtual addresses. If the effective address is outside the bounds of the RMA, the storage interrupt signals a PTEG miss. The platform firmware prepopulates the LPAR's page frame table with “bolted” entries representing the real storage blocks that make up the RMA. Note, this method allows for the RMA to be discontiguous in real address space. The Virtualized Real Mode Area (VRMA) option gives the OS the ability to dynamically relocate, expand, and shrink the RMA. See for more details.

General LPAR Reservations and Conventions This section documents general LPAR reserved facilities and conventions. Other sections document reserved facilities and conventions specific to the function they describe. R1--1. For the LPAR option: To avoid conflict with the platform’s reserved addresses, the OS must not use the 1 TB (SLB and PTE B field equal to one) 0b00 || 0x001FFFFFF VSID for purposes other than virtualizing the RMA. R1--2. For the LPAR option: In order to avoid a storage exception, the OS must not remove PTEs marked with the “bolted” indicator (PTE bit 59 = 1) unless the virtual address space can be referenced by another PTE or the OS does not intend to access the virtual address space. R1--3. For the LPAR option: To avoid conflict with the platform’s hypervisor, the OS must be prepared to share use of SPRG2 as the interrupt scratch register whenever an hcall() is made, or a machine check or reset interrupt is taken. R1--4. For the LPAR option: If the platform virtualizes the RMA, prior to transferring control to the OS, the platform must select a page size for the RMA such that the platform uses only one page table entry per page table entry group to virtualize the RMA. R1--5. For the LPAR option: If the platform virtualizes the RMA, prior to transferring control to the OS, the platform must use only the last page table entry of a page table entry group to virtualize the RMA.

Processor Requirements R1--1. For the LPAR option: The platform processors must support the Logical Partitioning (LPAR) facilities as defined in (Version 2.0 or later).

I/O Sub-System Requirements The platform divides the I/O subsystem up into Partitionable Endpoints (PEs). See for more information on PEs. Each PE has its own (separate) error, addressing, and interrupt domains which allows the assignment of separate PEs to different LPAR partitions. The following are the requirements for I/O subsystems when the platform implements LPAR. R1--1. For the LPAR option: The platform must provide methods and mechanisms to isolate IOA and I/O bus errors from one PE from affecting another PE, from affecting a partition to which the PE is not given access authority by the platform, and from affecting system resources such as the service processor which are shared between partitions, and must do so with the EEH option programming model. R1--2. For the LPAR option: The platform must enable the EEH option for all PEs by default. Software and Firmware Implementation Notes: For the platform (versus the OS or device driver) to enable EEH, there must be some assurance that the device drivers are EEH aware, if not EEH enabled. For example, the device driver or OS may signal its awareness by using ibm,set-eeh-option RTAS call to enable EEH prior to a configuration cycle via the ibm,write-pci-config RTAS call which enables the Memory Space or IO Space enable bits in the PCI Command register, and firmware can ignore the ibm,write-pci-config RTAS call which enables the Memory Space or IO Space enable bits for an IOA if EEH for that IOA has not been enabled first. To be EEH aware, a device driver does not need to be able to recover from an MMIO Stopped and DMA Stopped state, only recognize the all-1's condition (from a Load from its IOA or on a PCI configuration read from its IOA) and not use data from operations that may have occurred since the last all-1's checkpoint. In addition, the device driver under such failure circumstances needs to turn off interrupts (using the ibm,set-int-off RTAS call, or for conventional PCI and PCI-X infrastructures only: by resetting the IOA and keeping it reset with ibm,set-slot-reset or ibm,slot-error-detail) to make sure that any (unserviceable) interrupts from the IOA do not affect the system (MSIs are blocked by the EEH DMA Stopped State, but LSIs are not). Note that if all-1’s data may be valid, the ibm,read-slot-reset-state2 RTAS call should be used to discover the true EEH state of the device. R1--3. For the LPAR option: The platform must assign a PE to one and only one partition at a time. R1--4. For the LPAR option: The platform must limit the DMA addresses accessible to a PE to the address ranges assigned to the partition to which the PE is allocated, and, if the PE is used to implement a VIO device, then also to any allowed redirected DMA address ranges. Architecture and Implementation Notes: Platforms which do not implement either Requirement or Requirement require PE granularity of everything below the PHB, resulting in poor LPAR partition I/O assignment granularity. Requirement has implications in preventing access to both to I/O address ranges and system memory address ranges. That is, Requirement requires prevention of peer to peer operations from one IOA to another IOA when those IOA addresses are not owned by the same partition, as well as to providing an access protection mechanism to protect system memory. Note that relative to peer to peer operations, some bridges or switches may not provide the capabilities to limit peer to peer, and the use of such bridges or switches require the limitation that all IOAs under such bridges or switches be assigned to the same partition. R1--5. For the LPAR option: The platform must provide a PE the capability of accessing all of the System Memory addresses assigned to the partition to which the PE is allocated. R1--6. For the LPAR option: If TCEs are used to satisfy Requirements , and , then the platform must provide the capability to map simultaneously and at all times at least 256 MB for each PE. R1--7. For the LPAR option: If TCEs are used to satisfy Requirements , and , then the platform must prevent any DMA operations to System Memory addresses which are not translated by TCEs. R1--8. For the LPAR option: The DMA address range accessible to a PCI IOA on its I/O bus must be defined by the “ibm,dma-window” property in its parent’s OF device tree node. Platform Implementation Note: To maximize the ability to migrate memory pages underneath active DMA operations, when ever possible, a bridge should create a bus for a single IOA and either its representing bridge node should include the “ibm,dma-window” property specific for the IOA for conventional PCI or PCI-X IOAs or the IOA function nodes should contain the “ibm,my-dma-window” property specific for the IOA function for PCI Express IOAs. When the configuration of a bus precludes memory migration, the platform may combine the DMA address for multiple IOAs that share a bus into a single “ibm,dma-window” property housed in the bridge node representing the bridge that creates the shared bus.

Interrupt Sub-System Requirements R1--1. For the LPAR option: The platform must not assign the same interrupt (LSI or MSI) or same interrupt source number to different PEs (interrupts cannot be shared between partitions). R1--2. For the LPAR option: The interrupt presentation layer must support at least one global interrupt queue per platform supported partition. R1--3. For the LPAR option: The interrupt presentation layer must separate the per processor interrupt management areas into a separate 4 K pages per processor so that they can each be individually protected by the PTE mechanism and assigned to their respective assigned partitions. R1--4. For the LPAR option: The platform must restrict the processors that service a global queue to those assigned to a single partition. R1--5. For the LPAR option: If the interrupt source layer supports message signaled interrupts, the platform must isolate the PCI Message interrupt Input Port (PMIP) in its own 4 K page of the platform’s address space. R1--6. For the LPAR option: If the interrupt source layer supports message signaled interrupts, the hardware must ignore all writes to the PMIP’s 4 K page except those to the PMIP itself. R1--7. For the LPAR option: If the interrupt source layer supports message signaled interrupts, the hardware must return all ones on reads of the PMIP’s 4 K page except those to the PMIP itself. Signalling a machine check interrupt to the affected processor on a read that returns all 1s as above is optional. R1--8. For the LPAR option: The interrupt source layer must support a means in addition to the inter-processor interrupt mechanism for the hypervisor to signal an interrupt to any processor assigned to a partition. Software Note: While firmware takes all reasonable steps to prevent it, it may be possible, on some hardware implementations, for an OS to erroneously direct an individual IOA’s interrupt to another partition’s processor. An OS supporting LPAR should ignore such “Phantom” interrupts.

Hypervisor Requirements The purpose of the hypervisor is to virtualize the facilities of platforms, with LPAR, such that multiple copies of a LoPAR compliant OS may simultaneously run protected from each other in different logical partitions of the platform. That is, the various OS images may, without explicit knowledge of each other, boot, run applications, handle exceptions and events and terminate without affecting each other. The hypervisor is entered by way of three interrupts: the System Reset Interrupt, the Machine Check Interrupt and System (hypervisor) Call Interrupt. These use hypervisor interrupt vectors 0x0100, 0x0200, and 0x0C00 respectively. In addition, a processor implementation dependent interrupt, at its assigned address may cause the hypervisor to be entered. The return from the hypervisor to the OS is via the rfid (Return from Interrupt Doubleword) instruction. The target of the rfid (instruction at the address contained in SRR0) is either a firmware glue routine (in the case of System Reset or Machine Check) or the instruction immediately following the invoking hypervisor call. The reason for the firmware glue routines is that the OS must do its own processing because of the asynchronous nature of System Reset or Machine Check interruptions. The firmware glue routine calls an OS registered recovery routine for the System Reset or Machine Check condition for further details see (reference to recoverable machine check ACR material to be added when available). The glue routines are registered by the partition’s OS through RTAS. Until the glue routines are registered, the OS does not receive direct reports of either System Reset or Machine Check interrupts but is simply re-booted by the hypervisor. The glue routines contain a register buffer area that the hypervisor fills with register values that the glue routine must pass to the OS when calling the interrupt handler. The last element in this buffer is a lock word. The lock word is set with the value of the using processor, and reset by the glue routine just before calling the OS interrupt handler. This way only one buffer is needed per partition rather than one per processor. At the invocation of the hypervisor, footprint records are generated for recovery conditions. Machine Check and Check Stop conditions are, in some cases, isolatable to the affected partition(s). In these cases, the hypervisor can then prove that it was not executing changes to the global system tables on the offending processor when the error occurred. If this cannot be proven, the global state of the complex is in doubt and the error cannot be contained. It is anticipated that check stops that only corrupt the internal state of the affected processor, stop that processor only. When the service processor subsequently notices the stopped processor it notifies one of the other processors in the partition through a simulated recoverable machine check. The hypervisor running on the notified processor then takes appropriate action to log out and restart the partition, or if there is an alternate cpu capability, then continue execution with a substitute for the stopped processor. The following table presents the functions supplied by the hypervisor. Architecture Note: Some functions performed by partition firmware (OF and RTAS) require hypervisor assist, but those firmware implementation dependent interfaces do not appear in this document. Architected hcall()s Function Name/Section Comments / Removes a PTE from the partition’s node Page Frame Table / Removes up to four (4) PTEs from the partition’s node Page Frame Table / Removes PTEs of a naturally aligned block of Virtual addresses from the partition’s Page Frame Table / Inserts a PTE into the partition’s node Page Frame Table / Reads the specified PTE from the partition’s node Page Frame Table / Clears the Modified bit in the specified PTE in the partition’s node Page Frame Table / Clears the Referenced bit in the specified PTE in the partition’s node Page Frame Table / Sets the Page Protection and Storage Key bits in the specified PTE in the partition’s node Page Frame Table / Prepares for resizing the partition's HPT / Changes the partition's HPT to a new size / Returns the value of the specified DMA Translation Control Entry / Inserts the specified value into the specified DMA Translation Control Entry / Inserts the specified value into multiple DMA Translation Control Entries / Inserts a list of values into the specified range of DMA Translation Control Entries / SPRG0 is architecturally a hypervisor resource. This call allows the OS to write the register. / DABR is architecturally a hypervisor resource. This call allows the OS to write the register. / Initializes pages in real mode either to zero or to the copied contents of another page. / Manage the Extended DABR facility. / Adjust implementation dependent tuning values / Set implementation dependent tuning switches / Returns the value contained in a cache inhibited logical address / Stores a value into a cache inhibited logical address / Returns up to 16 bytes of virtualized console terminal data. / Sends up to 16 bytes of data to a virtualized console terminal. / Gets a list of possible client Vterm IOA connections. / Associates server Vterm IOA to client Vterm IOA. / Breaks association between server Vterm IOA and client Vterm IOA. / Returns internal hypervisor work areas for code maintenance. / Clears the hash page table for a partition in preparation for a restart / Generates and End Of Interrupt / Sets the Processor’s Current Interrupt Priority / Generates an Inter-processor Interrupt / Polls for pending interrupt / Accepts pending interrupt / Get the ESB addresses for a LISN / Assign a target and priority to a LISN / Get the target and priority assigned to a LISN / Get the notification management page for a LISN / Set/Reset an EQ for a target and priority / Get the EQ set for a target and priority / Set the OS reporting cache line pair for a target / Get the OS reporting cache line pair for a target / Load or store operation on the ESB page for a LISN / Issue the requested sync / Reset interrupt state to the initial state / Open a terminal session with a Vterm IOA / Get data from a Vterm session / Put data to a Vterm session / Close an existing session with a Vterm IOA / Migrates the page underneath an active DMA operation. / Manages the performance monitor facility. / Registers the virtual processor area that contains the virtual processor dispatch count / Makes processor virtual processor cycles available for other uses (called when an OS image is idle) / Causes a virtual processor’s cycles to be transferred to a specified processor. (Called by a blocked OS image to allow a lock holder to use virtual processor cycles rather than waiting for the block to clear.) / Awakens a virtual processor that has ceded its cycles. / Returns the partition’s virtual processor performance parameters. / Sets the partition’s virtual processor performance parameters (within constraints). / Returns the value of the virtual processor utilization register. / Polls the hypervisor for the existence of pending work to dispatch on the calling processor. / Returns the summation of the physical processor pool’s idle cycles. / Register Command/Response Queue / Frees the memory associated with the Command/Response Queue / Controls the virtual interrupt signaling of virtual IOAs / Sends a message on the Command/Response Queue / Loads a Redirected Remote DMA Remote Translation Control Entry / Loads a list of Redirected Remote DMA Remote Translation Control Entries / Unmaps a redirected TCE that was previously built with H_PUT_RTCE or H_PUT_RTCE_INDIRECT / Allows modification of LIOBN Attributes. / Copies data between partitions as if by TCE mapped DMA. / Write parameter data to remote DMA buffer. / Read data from remote DMA buffer to return registers. / Registers the partition’s logical LAN control structures with the hypervisor / Releases the partition’s logical LAN control structures / Adds receive buffers to the logical LAN receive buffer pool / Sends a logical LAN message / Controls the reception and filtering of non-broadcast multicast packets. / Changes the MAC address for an ILLAN virtual IOA. / Allows modifications of ILLAN Attributes. / Constructs a cookie, specific to the intended client, representing a shared resource. / Invalidates a cookie representing a shared resource. / Maps a shared resource into the client’s logical address space / Removes a shared resource from a client’s logical address space. / Removes receive buffers of specified size from the logical LAN receive buffer pool. / Allows the partition to manipulate or query certain virtual IOA behaviors. / Join active threads and return H_CONTINUE to final calling thread / Use the calling processor to perform platform operations. / Transition VASI operation stream state. / Return the VASI operation stream state. / Reactivate a suspended CRQ. / Change the page mapping characteristics of the Virtualized Real Mode Area. / Returns Virtual Partition Memory pool statistics / Set Memory Performance Parameters / Get Memory Performance Parameters / Determine Memory Overcommit Performance / Register a Sub-CRQ. / Free a Sub-CRQ. / Send a message to a Sub-CRQ. / Send a list of messages to a Sub-CRQ. / Report the home node associativity for a given virtual processor / Get the partition energy management parameters / Returns hints for activating/releasing resource instances to achieve the best energy efficiency. / Registers Subvention Notification Structure / Get a random number / Initiate a co-processor operation / Stop a co-processor operation / Get Extended Memory Performance Parameters / Set Processing resource mode / Search TCE table for entries within a specified range / Configure Memory Usage Instrumentation / Reset page age and affinity log, and/or PUT/HBA / Return the page usage information for a logical address range of pages / Return multiple HBA / Invalidate the specified process segment from all segment lookaside buffers in the system. / Invalidate the specified process table entry. / Manage the virtual address translation mode including registration of a process table. / Attach a process to a coherent platform function. / Detach a process to a coherent platform function. / Control a coherent platform function. / Collect interrupt information for a coherent platform function. / Control faults for a coherent platform function. / Download an application to a coherent platform function. / Download an application to a coherent platform facility. / Control a coherent platform facility.

System Reset Interrupt Hypervisor code saves all processor state by saving the contents of one register in SPRG2 (SPRG1 if ibm,nmi-register-2 was used) (Multiplexing the use of this resource with the OS). The processor’s stack and data area are found by processing the Processor Identification Register. R1--1. For the LPAR option: The platform must support signalling system reset interrupts to all processors assigned to a partition. R1--2. For the LPAR option: The platform must support signalling system reset interrupts individually as well as collectively to all supported partitions. R1--3. For the LPAR option: The system reset interrupts signaled to one partition must not affect operations of another partition. R1--4. For the LPAR option: The hypervisor must intercept all system reset interrupts. R1--5. For the LPAR option: The platform must implement the FWNMI option. R1--6. For the LPAR option: The hypervisor must maintain a count, reset when the partition’s OS, through RTAS, registers for system reset interrupt notification, of system reset interrupts signaled to a partition’s processor. R1--7. For the LPAR option: Once the partition’s OS has registered for system reset interrupt notification, the hypervisor must forward the first and second system reset interrupts signaled to a partition’s processor. R1--8. For the LPAR option: The hypervisor must on the third and all subsequent system reset interrupts signaled to a partition’s processor invoke OF to initiate the partition’s reboot policy. R1--9. For the LPAR option: The hypervisor must have the capability to receive and handle the system reset interrupts simultaneously on multiple processors in the same or different partitions up to the number of processors in the system.

Machine Check Interrupt Hypervisor code saves all processor state by saving the contents of one register in SPRG2 (SPRG1 if ibm,nmi-register-2 was used) (Multiplexing the use of this resource with the OS). The processor’s stack and data area are found by processing the Processor Identification Register. The hypervisor investigates the cause of the machine check. The cause is either a recoverable event on the current processor, or a non-recoverable event either on the current processor or one of the other processors in the logical partition. Also the hypervisor must determine if the machine check may have corrupted its own internal state (by looking at the footprints, if any, that were left in the per processor data area of the errant processor. R1--1. For the LPAR option: The hypervisor must have the capability to receive and handle the machine check interrupts simultaneously on multiple processors in the same or different partitions up to the number of processors in the system.

Hypervisor Call Interrupt The hypervisor call (hcall) interrupt is a special variety of the system call instruction. The parameters to the hcall() are passed in registers using the PA ABI definitions (Reg 3-12 for parameters). In contrast to the PA ABI, pass by reference parameters are avoided to or from hcall(). This minimizes the address translation problem pointer parameters cause. Some input parameters are indexes. Output parameters, when generated, are passed in registers 4 through 12 and require special in-line assembler code on the part of the caller. The first parameter to hcall() is the function token. specifies the valid hcall() function names and token values. Some of the hcall() functions are optional, to indicate if the platform is in LPAR mode, and which functions are available on a given platform, the OF property “ibm,hypertas-functions” is provided in the /rtas node of the partition’s device tree. The property is present if the platform is in LPAR mode while its value specifies which function sets are implemented by a given implementation. If platform implements any hcall() of a function set it implements the entire function set. Additionally, certain values of the “ibm,hypertas-functions” property indicate that the platform supports a given architecture extension to a standard hcall(). The floating point registers along with the FPSCR are in general preserved across hcall() functions, unless the “Maintain FPRs” field of the VPA =0, see . The general purpose registers r0 and r3-r12, the CTR and XER registers are volatile along with the condition register fields 0 and 1 plus 5-7. Specific hcall()s may specify a more restricted “kill set”, refer to the specific hcall() specification below. R1--1. For the LPAR option: The platform’s /rtas node must contain an “ibm,hypertas-functions” property as defined below. R1--2. For the LPAR option: If a platform reports in its “ibm,hypertas-functions” property (see ) that it supports a function set, then it must support all hcall()s of that function set as defined in . Hypervisor Call Function Table Hypervisor Call Function Name/Section Hypervisor Call Function Token Hypervisor Call Performance Class Function Mandatory? Function Set UNUSED 0x0 / 0x4 Critical Yes hcall-pft / 0x8 Critical Yes hcall-pft / 0xC Critical Yes hcall-pft / 0x10 Critical Yes hcall-pft / 0x14 Critical Yes hcall-pft / 0x18 Critical Yes hcall-pft / 0x1C Critical Yes hcall-tce / 0x20 Critical Yes hcall-tce / 0x24 Critical Yes hcall-sprg0 / 0x28 Critical Yes - if DABR exists hcall-dabr / 0x2C Critical Yes hcall-copy / 0x3C Normal Yes hcall-debug / 0x40 Normal Yes hcall-debug / 0x54 Critical Yes hcall-term / 0x58 Critical Yes hcall-term / 0x60 Normal Yes if enabled by HMC (default disabled) hcall-dump / 0x64 Critical Yes hcall-interrupt / 0x68 Critical Yes hcall-interrupt / 0x6C Critical Yes hcall-interrupt / 0x70 Critical Yes hcall-interrupt H_XIRR / 0x74 Critical Yes hcall-interrupt H_XIRR-X / 0x2FC Critical Yes hcall-interrupt / 0x78 Normal If LRDR option is implemented hcall-migrate / 0x7C Normal If performance monitor is implemented hcall-perfmon Reserved 0x80 - 0xD8 / 0xDC Normal If SPLPAR or SLB Shadow Buffer option is implemented hcall-splpar SLB-Buffer / 0xE0 Critical If SPLPAR option is implemented hcall-splpar / 0xE4 Critical If SPLPAR option is implemented hcall-splpar / 0xE8 Critical If SPLPAR option is implemented hcall-splpar / 0xEC Normal If SPLPAR option is implemented hcall-splpar / 0xF0 Normal If SPLPAR option is implemented hcall-splpar / 0xF4 Critical If SPLPAR option is implemented hcall-splpar / 0xF8 Normal If SPLPAR option is implemented hcall-pic / 0xFC Normal If VSCSI option is implemented hcall-crq / 0x100 Normal If VSCSI option is implemented hcall-crq / 0x104 Critical If either the VSCSI or logical LAN option is implemented hcall-vio / 0x108 Critical If VSCSI option is implemented hcall-crq / 0x10C Critical If VSCSI option is implemented hcall-rdma / 0x110 Critical If VSCSI option is implemented hcall-rdma / 0x114 Normal If logical LAN option is implemented hcall-lLAN / 0x118 Normal If logical LAN option is implemented hcall-lLAN / 0x11C Critical If logical LAN option is implemented hcall-lLAN / 0x120 Critical If logical LAN option is implemented hcall-lLAN / 0x124 Critical New designs as of 01/01/2003 hcall-bulk / 0x128 Critical If VSCSI option is implemented hcall-rdma / 0x12C Critical If VSCSI option is implemented hcall-rdma / 0x130 Critical If logical LAN option is implemented hcall-lLAN / 0x134 Normal If Extended DABR option is implemented hcall-xdabr / 0x138 Critical hcall-multi-tce / 0x13C Critical hcall-multi-tce / 0x140 Critical hcall-multi-tce Reserved 0x144 Reserved 0x148 / 0x14C Normal If Logical LAN option is implemented hcall-ILAN / 0x150 Normal If the Server Vterm option is implemented hcall-vty / 0x154 Normal If the Server Vterm option is implemented hcall-vty / 0x158 Normal If the Server Vterm option is implemented hcall-vty / 0x1C4 Normal If Shared Logical Resource option is Implemented hcall-slr / 0x1C8 Normal If Shared Logical Resource option is Implemented hcall-slr / 0x1CC Normal If Shared Logical Resource option is Implemented hcall-slr / 0x1D0 Normal If Shared Logical Resource option is Implemented hcall-slr / 0x1D4 Critical If logical LAN option is implemented hcall-lLAN / 0x1D8 Critical If SPLPAR option is implemented hcall-poll-pending Reserved 0x1DC - 0x1E0 Varies Reserved 0x1E8 - 0x1EC Reserved 0x1F0 - 0x23C Varies / 0x240 Normal If LIOBN Attributes are implemented hcall-liobn-attributes / 0x244 Normal If ILLAN Checksum Offload Support is implemented If ILLAN Backup Trunk Adapter option is implemented hcall-illan-options Reserved 0x248 / 0x24C Critical If H_PUT_RTCE is implemented hcall-rdma Reserved 0x27C Reserved 0x280 Reserved 0x28C-0x294 / 0x298 Normal If Thread Join option is implemented hcall-join / 0x29C Normal If VASI option is implemented hcall-vasi / 0x2A0 Normal If VASI option is implemented hcall-vasi / 0x2A4 Normal If VASI option is implemented hcall-vasi / 0x2A8 Normal If any virtual I/O options are implemented hcall-vioctl / 0x2AC Normal If the VRMA option is implemented. hcall-vrma / 0x2B0 Continued If partition suspension option is implemented hcall-suspend Reserved 0x2B4 / 0x2B8 Normal If the Partition Energy Management Option is implemented hcall-get-emparm / 0x2BC Normal If the Cooperative Memory Over-commitment Option is implemented hcall-cmo / 0x2D0 Normal If the Cooperative Memory Over-commitment Option is implemented hcall-cmo / 0x2D4 Normal If the Cooperative Memory Over-commitment Option is implemented hcall-cmo / 0x2D8 Normal If the Cooperative Memory Over-commitment Option is implemented and the calling partition is authorized. hcall-cmo / 0x2DC Normal If the Subordinate CRQ Option is implemented hcall-sub-crq / 0x2E0 Normal If the Subordinate CRQ Option is implemented hcall-sub-crq / 0x2E4 Normal If the Subordinate CRQ Option is implemented hcall-sub-crq / 0x2E8 Normal If the Subordinate CRQ Option is implemented hcall-sub-crq / 0x2EC Normal If the VPHN Option is implemented hcall-vphn Reserved 0x2F0 / 0x2F4 Normal If the Partition Energy Management Option is implemented hcall-best-energy-1<list> The <list> suffix for hcall-best-energy indicates an optional dash delimited series (may be null) of supported resource codes encoded as ASCII decimal values in addition to the minimal support value of 1 for processors, other values are define in . / 0x2F8 Normal If the Expropriation Subvention Notification Option is implemented hcall-esn X_XIRR-X / 0x2FC Critical Yes hcall-interrupt / 0x300 Normal If a random number generator Platform Facilities Option is implemented hcall-random Reserved 0x310 / 0x304 Normal If one or more Coprocessor Platform Facilities Options are implemented hcall-cop / 0x308 Normal If one or more Coprocessor Platform Facilities Options are implemented hcall-cop / 0x314 Normal If the Extended Cooperative Memory Overcommittment Option is implemented hcall-cmo-x / 0x31C Normal If the platform supports POWER ISA version 2.07 or higher hcall-set-mode Reserved 0x320 / 0x324 normal If the plaform implements the LRDR option at LoPAR Version 2.7 or higher hcall-xlates-limited / 0x328 Critical if the platform supports the "block invalidate" option hcall-block-remove / 0x32C normal If the Memory Usage Instrumentation Option is implemented hcall-mui / 0x330 normal If the Memory Usage Instrumentation Option is implemented hcall-mui / 0x334 normal If the Memory Usage Instrumentation Option is implemented hcall-mui / 0x338 normal If the Memory Usage Instrumentation Option is implemented hcall-mui / 0x33C Normal If the plaform implements the LRDR option at LoPAR Verstion 2.8 or higher hcall-implementation-dependent-tuning / 0x340 Normal If the plaform implements the LRDR option at LoPAR Verstion 2.8 or higher hcall-implementation-dependent-tuning / 0x344 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x348 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x34C Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x350 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x354 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x358 Terminal For LoPAR Verstion 2.8 and higher hcall-clr-hpt / 0x35C Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca Reserved 0x360 / 0x364 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x368 Normal If one or more Coherent Platform Facilities Options are implemented hcall-ca / 0x36C Normal If the platform supports the Hash Page Table Resize Option hcall-hpt-resize / 0x370 Normal If the platform supports the Hash Page Table Resize Option hcall-hpt-resize / 0x374 Normal If the platform supports the In-Memory Table Translation Option hcall-imtt / 0x378 Normal If the platform supports the In-Memory Table Translation Option hcall-imtt / 0x37C Normal If the platform supports the In-Memory Table Translation Option hcall-imtt / 0x3A8 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3AC Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3B0 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3B4 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3B8 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3BC Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3C0 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3C4 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3C8 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3CC Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x3D0 Normal If the OS enabled XIVE exploitation hcall-int-exploitation / 0x408 Normal If VSM is implemented hcall-vsm / 0x40C Critical If VSM is implemented hcall-vsm / 0x410 Critical If VSM is implemented hcall-vsm / 0x414 Normal If VSM is implemented hcall-vsm hcalls to support an Ultravisor 0xEF00 - 0xEF80 Reserved for platform-dependent hcall()s / 0xF000 - 0xFFFC ILLEGAL Any token value having a one in either of the low order two bits Reserved 0x380 - 0xEEFF and 0x10000 - 0xFFFFFFFF-FFFFFFFC: RTAS implementations may assigns values in these ranges to their own internal interfaces, as long as they are prepared for the growth of architected functions into this range.

Firmware Implementation Note: The assignment of function tokens is designed such that a single mask operation can validate that the value is within the range of a reasonable size branch table. Entries within the branch table can handle unimplemented code points. The hypervisor routines are optimized for execution speed. In some rare cases, locks are taken, and specific hardware designs require short wait loops. However, if a needed resource is truly busy, or processing is required by an agent, the hypervisor returns to the caller, either to have the function retried or continued at a later time. The Performance Class establishes specific performance requirements against each specific hcall() function as defined below. Hypervisor Call Performance Classes: Critical Must make continuous forward progress, encountering any busy resource must cause the function to back out and return with a “hardware busy” return code. When subsequently called, the operation begins again. Short loops for larwx and stwcx to acquire an apparently unheld lock are allowed. These functions may not include wait loops for slow hardware access. Normal Similar to critical, however, wait loops for slow hardware access are allowed. These functions may not include wait loops for an agent such as an external micro-processor or message transmission device. Continued This class of functions is expected to serialize on the use of external agents. If the external agent is busy the function returns “hardware busy”. If the interface to the external agent is not busy, the interface is marked busy and used to start the function. The function returns one of the “function in progress” return codes. Later, the caller may check on the completion of the function by issuing the “check” Hcall function with the “function in progress” parameter code. If the function completed properly, the hypervisor maintains no status and the “check” Hcall returns success. If the operation is still in process, the same “function in progress” code is returned. If the function completed in error, the completion error code is returned. The hypervisor maintains room for at least one outstanding error status per external agent interface per processor. If there is no room to record the error status, the hypervisor returns “hardware busy” and does not start the function. Terminal This class of functions is used to manage a partition when the OS is not in regular operation. These events include postmortems and extensive recoveries. The hypervisor performance classes are ordered in decreasing restriction. R1--3. For the LPAR option: The caller must perform properly given that the hypervisor meets the performance class specified. R1--4. For the LPAR option: The hypervisor implementation must meet the specified performance class or higher. R1--5. For the LPAR option: Platform hardware designs must take the allowable performance classes into account when choosing the hardware access technology for the various facilities. R1--6. For the LPAR option: The hypervisor must have the capability to receive and handle the hypervisor call interrupts simultaneously on multiple processors in the same or different partitions up to the number of processors in the system. R1--7. For the LPAR option: The hypervisor must check the state of the MSR register bits that are not set to a specific value by the processor hardware during the invoking interrupt per . MSR State on Entrance to Hypervisor MSR Bit Required State Error-Code HV - Hypervisor 1 None Bits 2,4:46, 57, and 60 Reserved Set to 0 by Hardware None ILE - Interrupt Little Endian As Last set by the hypervisor None ME - Machine check Enable As last set by the hypervisor None LE Little-Endian Mode 0 forced by ILE None

R1--8. For the LPAR option: The Hcall() flags field must meet the definition in: ; the hypervisor may safely ignore flag field values not explicitly defined by the specific hcall() semantic. R1--9. For the LPAR option: The platform must ensure that flag field values not defined for a specific hcall() do not compromise partitioning integrity. R1--10. For the LPAR option: Implementations that normally choose to ignore invalid flag field values must provide a “debug mode” that does check for invalid flag field values and returns H_Parameter when any are found. Architecture Note: The method for invocation of a platform’s “debug mode” is beyond the scope of this architecture. Page Frame Table Access flags field definition Bit Function Bit Function Bit Function Bit Function 0-15 NUMA CEC Cookie 16-23 Subfunction Codes 32 AVPN 48 Zero Page 33 andcond 49 Copy Page 34-39 Reserved 50-54 key0-key4 Bits 50-54 (key0 - key4) shall be treated as reserved on platforms that either do not contain an “ibm,processor-storage-keys” property, or contain an “ibm,processor-storage-keys” property with the value of zero in both cells. 55 pp0 Bit 55 (pp0) shall be treated as reserved on platforms that do not have the “Support for the “110” value of the Page Protection (PP) bits” bit set to a value of 1 in the “ibm,pa-features” property. 24 Exact 40 I-Cache-Invalidate 56 Compression 25 R-XLATE 41 I-Cache-Synchronize 57 Checksum 26 READ-4 42 CC (Coalesce Candidate) 58-60 Reserved 27 Reserved 43-46 MUI Options Bits 43-46 (MUI Options) detail is provided in in . 28-31 CMO Option Flags 61 N 62 pp1 47 Reserved 63 pp2

R1--11. For the LPAR option: The caller of Hcall must be in privileged mode (MSRPR = 0) or the hypervisor immediately returns an H_Privilege return code. See for this and other architected return codes. R1--12. For the LPAR option: The caller of hcall() must be prepared for a return code of H_Hardware from all functions. R1--13. For the LPAR option: In order for the platform to return H_Hardware, the error must not have resulted in an undetectable state/data corruption nor will continued operation propagate an undetectable state/data corruption as a result of the original error. Notes: A detectable corruption, when accessed, results in either a H_Hardware return code, machine check or check stop per platform policy. Among other implications of Requirement are: the effective state of the partition appears to not change due to the failed hcall() -- (any partial changes to persistent state/data are backed out); and the recovery of platform resources that held lost state/data does not hide the state/data loss to subsequent users of that state/data. The operating system is not expected to log a serviceable event due to an H_Hardware return code from an hcall(), and treats the hcall() as failing due to nonspecific hardware reasons. Any logging of a serviceable event in response to the underlying cause is handled by separate platform initiated operations. Hypervisor Call Return Code Table Hypervisor Call Return Code Values (R3) Meaning 0x0100000 - 0x0FFFFFFF Function in Progress 9905 H_LongBusyOrder100sec - Similar to LongBusyOrder1msec, but the hint is 100 second wait this time. 9904 H_LongBusyOrder10sec - Similar to LongBusyOrder1msec, but the hint is 10 second wait this time. 9903 H_LongBusyOrder1Sec - Similar to LongBusyOrder1msec, but the hint is 1 second wait this time. 9902 H_LongBusyOrder100mSec - Similar to LongBusyOrder1msec, but the hint is 100mSec wait this time. 9901 H_LongBusyOrder10mSec - Similar to LongBusyOrder1msec, but the hint is 10mSec wait this time. 9900 H_LongBusyOrder1msec - This return code is identical to H_Busy, but with the added bonus of a hint to the partition OS. If the partition OS can delay for 1 millisecond, the hcall will likely succeed on a new hcall with no further busy return codes. If the partition OS cannot handle a delay, they are certainly free to immediately turn around and try again. 18 H_CONTINUE 17 H_PENDING 16 H_PARTIAL_STORE 15 H_PAGE_REGISTERED 14 H_IN_PROGRESS 13 Sensor value >= Critical high 12 Sensor value >= Warning high 11 Sensor value normal 10 Sensor value <= Warning low 9 Sensor value <= Critical low 5 H_PARTIAL (The request completed only partially successful. Parameters were valid but some specific hcall function condition prevented fully completing the architected function, see the specific hcall definition for possible reasons.) 4 H_Constrained (The request called for resources in excess of the maximum allowed. The resultant allocation was constrained to maximum allowed) 3 H_NOT_AVAILABLE 2 H_Closed (virtual I/O connection is closed) 1 H_Busy Hardware Busy -- Retry Later 0 H_Success -1 H_Hardware (Error) -2 H_Function (Not Supported) -3 H_Privilege (Caller not in privileged mode). -4 H_Parameter (Outside Valid Range for Partition or conflicting) -5 bad_mode (Illegal MSR value) -6 H_PTEG_FULL (The requested pteg was full) -7 H_Not_Found (The requested entity was not found) -8 H_RESERVED_DABR (The requested address is reserved by the hypervisor on this processor) -9 H_NOMEM -10 H_AUTHORITY (The caller did not have authority to perform the function) -11 H_Permission (The mapping specified by the request does not allow for the requested transfer) -12 H_Dropped (One or more packets could not be delivered to their requested destinations) -13 H_S_Parm (The source parameter is illegal) -14 H_D_Parm (The destination parameter is illegal) -15 H_R_Parm (The remote TCE mapping is illegal) -16 H_Resource (One or more required resources are in use) -17 H_ADAPTER_PARM (invalid adapter) -18 H_RH_PARM (resource not valid or logical partition conflicting) -19 H_RCQ_PARM (RCQ not valid or logical partition conflicting) -20 H_SCQ_PARM (SCQ not valid or logical partition conflicting) -21 H_EQ_PARM (EQ not valid or logical partition conflicting) -22 H_RT_PARM (invalid resource type) -23 H_ST_PARM (invalid service type) -24 H_SIGT_PARM (invalid signalling type) -25 H_TOKEN_PARM (invalid token) -27 H_MLENGTH_PARM (invalid memory length) -28 H_MEM_PARM (invalid memory I/O virtual address) -29 H_MEM_ACCESS_PARM (invalid memory access control) -30 H_ATTR_PARM (invalid attribute value) -31 H_PORT_PARM (invalid port number) -32 H_MCG_PARM (invalid multicast group) -33 H_VL_PARM (invalid virtual lane) -34 H_TSIZE_PARM (invalid trace size) -35 H_TRACE_PARM (invalid trace buffer) -36 H_TRACE_PARM (invalid trace buffer) -37 H_MASK_PARM (invalid mask value) -38 H_MCG_FULL (multicast attachments exceeded) -39 H_ALIAS_EXIST (alias QP already defined) -40 H_P_COUNTER (invalid counter specification) -41 H_TABLE_FULL (resource page table full) -42 H_ALT_TABLE (alternate table already exists / alternate page table not available) -43 H_MR_CONDITION (invalid memory region condition) -44 H_NOT_ENOUGH_RESOURCES (insufficient resources) -45 H_R_STATE (invalid resource state condition or sequencing error) -46 H_RESCINDED -54 H_Aborted -55 H_P2 -56 H_P3 -57 H_P4 -58 H_P5 -59 H_P6 -60 H_P7 -61 H_P8 -62 H_P9 -63 H_NOOP -64 H_TOO_BIG -65 Reserved -66 Reserved -67 H_UNSUPPORTED (Parameter value outside of the range supported by this implementation) -68 H_OVERLAP (unsupported overlap among passed buffer areas) -69 H_INTERRUPT (Interrupt specification is invalid) -70 H_BAD_DATA (uncorrectable data error) -71 H_NOT_ACTIVE (Not associated with an active operation) -72 H_SG_LIST (A scatter/gather list element is invalid) -73 H_OP_MODE (There is a conflict between the subcommand and the requested operation notification) -74 H_COP_HW (co-processor hardware error) -75 H_STATE (invalid state) -76 H_RESERVED (a reserved value was specified) -77 H_IN_USE (a specified resource is already in use) -78 : -255 Reserved -256 -- -511 H_UNSUPPORTED_FLAG (An unsupported binary flag bit was specified. The returned value is -256 - the bit position of the unsupported flag bit [high order flag bit is 0 etc.])

Hypervisor Call Functions

Page Frame Table Access All hypervisor Page Frame Table (PFT) access routines are called using 64 bit linkage conventions and apply to all page sizes that the platform supports as specified by the “ibm,processor-page-sizes” property. (See for more details.) The Page actual size is encoded in the PFT entry per the architecture Book IIIs along with the segment base page size per the Book IVa. The hypervisor PFT access functions carefully update a given Page Table Entry (PTE) with at least 64 bit store operations since an invalid update sequence could result in machine checks. To guard against multiple conflicting allocations of a PTE that could result in a check stop condition, the hypervisor PTE allocation routine (H_ENTER) reserves the first two (high order) software PTE bits for use as PTE locks while the low order two software PTE bits are reserved for OS use (not used by firmware). If a firmware PTE bit is on, the OS is to assume that the PTE is in use, just as if the V bit were on. The hypervisor PFT access routines often execute the tlbie instruction, on certain platforms, this instruction may only be executed by one processor in a partition at a time, the hypervisor uses locks to assure this. The tlbie instruction flushes a specific translate lookaside buffer (TLB) entry from all processors participating in the protocol. All the processors participating in the tlbie protocol are defined as a translation domain. All processors of a given partition that are in a given translation domain share the same hardware PFT. Book III of the PA specifies the codes sequences needed to safely access the PFT, in its chapter titled “Storage Control Instructions and Table Updates”. These code sequences are part of this specification by reference. The hypervisor PFT access routines are in the critical performance path of the machine, therefore, extraordinary care must be given to their performance, including machine dependent coding, minimal run time checking, and code path length optimization. For performance reasons, all parameter linkage is through registers, and no indirect parameter linkage is allowed. This requires special glue code on the part of the caller to pick up the return parameters. The hypervisor PFT access routines modify the calling processor’s partition PFT on the calling node. On NUMA systems, if an LPAR partition spans multiple Central Electronics Complexes (CECs), the partition’s processors may be in separate translation domains. Each platform translation domain has a separate PFT. Therefore, the partition’s OS must modify each PFT individually. This is done either by making hcall() accesses specifying the NUMA CEC Cookie (which identifies the translation domain) in the high order 16 bits of the flags parameter (H_ENTER and H_READ only) or by issuing the hcall() from a processor within the translation domain as identified by the processor’s NUMA CEC Cookie field of the “ibm,pft-size” property. The PFT is preallocated based upon the value of the partition’s PFT_size configuration variable. This configuration variable is initialized to 4 PTEs per node local page frame and 2 PTEs per remote node page frame. The size of the PFT per node is communicated to the partition’s OS image via the “ibm,pft-size” property of the node. The value of the configuration variable PFT_size consists of two comma separated integers, the first is the number of hardware PFT entries to allocate per CEC local page, and the second is the number of hardware PFT entries to allocate per remote CEC page (if NUMA configured). These allocations are made at partition boot time based upon the initial partition memory allocation, based upon specific situations (such as low page table usage or future need for dynamic memory addition) the OS may wish to override the platform default values. R1--1. For the LPAR option: The platform must allocate the partition’s page frame table. The size of this table is determined by the PFT_size configuration variable in the OS image’s “common” NVRAM partition. R1--2. For the LPAR option: The platform must provide the “ibm,pft-size” property in the processor nodes of the device tree (children of type cpu of the /cpus node). Register Linkage (For hcall() tokens 0x04 - 0x18) On Call: R3 function call token R4 flags (see ) R5 Page Table Entry Index (PTEX) R6 Page Table Entry High word (PTEH) (on H_ENTER only) R7 Page Table Entry Low word (PTEL) (on H_ENTER only) On Return: R3 Status Word R4 chosen PTEX (from H_ENTER) / High Order Half of old PTE R5 Low Order Half of old PTE R6 Semantics checks for all hypervisor PTE access routines: Hypervisor checks that the caller was in privileged mode or H_Privilege return code. On NUMA platforms for the H_ENTER and H_READ calls only, the hypervisor checks that the NUMA CEC Cookie is within the range of values assigned to the partition else return H_Parameter. Hypervisor checks that the PTEX is zero or greater and less than the partition maximum, else H_Parameter return code. Hypervisor checks the logical address contained in any PTE to be entered into the PFT to insure that it is valid and then translates the logical address into the assigned physical address. When hypervisor returns the contents of a PTE, the contents of the RPN are usually architecturally undefined. It is expected that hypervisor implementations leave the contents of this field as it was read from the PTE since it cannot be used by the OS to directly access real memory. The exception to this rule is when the R-XLATE flag is specified to the H_READ hcall(), then the RPN in the PTE is reverse translated into the LPN prior to return. Logical addressing: LPAR adds another level of virtual address translation managed by the hypervisor. The OS is never allowed to use the physical address of its memory this includes System Memory, MMIO space, NVRAM etc. The OS sees System Memory as N regions of contiguous logical memory. Each logical region is mapped by the hypervisor into a corresponding block of contiguous physical memory on a specific node. All regions on a specific system are the same size though different systems with different amount of memory may have different region sizes since they are the quantum of memory allocation to partitions. That is, partitions are granted memory in region size chunks and if a partition’s OS gives up memory, it is in units of a full region. On NUMA platforms, groups of regions may be associated with groups of processors forming logical CECs for allocation and migration purposes. Logical addresses are divided into two fields, the logical region identifier and the region offset. The region offset is the low order bits needed to represent the region size. The logical region identifier are the remaining high order bits. Logical addresses start at zero. When control is initially passed to the OS from the platform, the first region is the single RMA. The first region has logical region identifier of zero. This first region is specified by the first address - length pair of the “reg” property of the /memory node of the OF device tree. Subsequent regions each have their own address - length pair. At initial program load time, the logical region identifiers are sequential starting at zero but over time, with dynamic memory reconfiguration, holes may appear in the partition’s address space. Logical to physical translation: This translation is based upon a simple indexed table per partition of the physical addresses associated with the start of each region (in logical region identifier order). At least two special values are recognized: The invalid value for those regions that do not have a physical mapping (so that there can be holes in the logical address map for various reasons such as memory expansion). The I/O region value, that calls for further checking against partition I/O address range allocations. The logical region identifier is checked for being less than the maximum size, and then used to index the logical to physical translation table. If the physical region identifier is valid (certain values are reserved say 0 and all F’s) then it replaces the logical region identifier in the PTE and the PTE access function continues. If the physical region identifier is the I/O region, then proceed to the I/O translation algorithm (implementation dependent based upon platform characteristics). If the physical region identifier is invalid, return H_Parameter R1--3. For the LPAR option: The OS must make no assumptions about the logical to physical mapping other than the low order bits. R1--4. For the LPAR option: Each logical region must have its own address - length pair in the “reg” property of the OF /memory node. R1--5. For the LPAR option: When control is initially passed to the OS from the platform, the first logical region (having logical region identifier 0) must be the region accessed when the OS operates with translate off. R1--6. For the LPAR option: When control is initially passed to the OS from the platform, the size of the logical region must be equal to a real mode length size supported by the platform. R1--7. For the LPAR option: Each logical region must start and end on a boundary of the largest page size that the logical region supports (see “ibm,dynamic-memory” and “ibm,lmb-page-sizes” in as well as for more details). R1--8. For the LPAR option: The pages that contain the platform’s per processor interrupt management areas or any other device marked “used-by-rtas” must not be mapped into the partition virtual address space. R1--9. For the LPAR option: Each logical region must support all page sizes presented in the “ibm,processor-page-sizes” property in that are less than or equal to the size of the logical region as specified by either the OF standard “reg” property of the logical region’s OF /memory node, or the “ibm,lmb-size” property of the logical region’s /ibm,dynamic-reconfiguration-memory node in . Implementation Note: 32 bit versions of AIX only support 36 bit logical address memory spaces. Providing such a partition with a larger logical memory address space may cause OS failures. Implementation Note: Requirement may be met by ensuring that all logical regions start and end on a boundary of the largest page size supported by the platform.

H_REMOVE This hcall is for invalidating an entry in the page table. The PTEX identifies a specific page table entry. If the PFO option is implemented an optional flag causes the hypervisor to compress the page contents to one or more data blocks after invalidating the page table entry given that a compression coprocessor is available and the page is small enough to be synchronously compressed. If the compression coprocessor is busy, or the page is too large, the compression can be subsequently performed using the H_COP_OP hcall() see . If the page contents are compressed, then a checksum may be appended by setting the checksum flag - if the compression flag is not set the checksum flag is ignored. Syntax: Parameters: flags: AVPN, andcond, and for the CMO option: CMO Option flags as defined in and for the PFO option the compression and checksum flags. PTEX (index of the PTE in the page table to be used) AVPN: Optional “Abbreviated Virtual Page Number” -- used as a check for the correct PTE When the AVPN flag is set, the contents of the AVPN parameter are compared to the first double word of the PTE (after bits 57-63 of the PTE have been masked). Note, the low order 7 bits are undefined and should be zero otherwise the likely result is a return code of H_Not_Found. When the andcond flag is set, the contents of the AVPN parameter are bit anded with the first double word of the PTE. If the result is non-zero the return code is H_Not_Found. out: For the PFO option, the output data block logical real address when the compression flag bit is on. outlen: For the PFO option, the length of the compression data block or compression data block descriptor list when the compression flag bit is on. Semantics: Check that the PTEX accesses within the PFT else return H_Parameter If the AVPN flag is set, and the AVPN parameter bits 0-56 do not match that of the specified PTE then return H_Not_Found. If the andcond flag is set, the AVPN parameter is bit anded with the first double word of the specified PTE, if the result is non-zero, then return H_Not_Found. The hypervisor Synchronizes the PTE specified by the PTEX and returns its value Use the architected “Deleting a Page Table Entry” sequence such that the first double word of the resultant PFT entry is all 0s. Use the proper tlbie instruction for the page size within a critical section protected by the proper lock (per large page bit in the specified PTE). The synchronized value of the old PTE value ends up in R4 and R5 for return to the caller. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . For the PFO option: If the Compression flag is on: Check that the calling partition is authorized to use the compression co-processor else return H_Function. If the page is not “main store memory” then return H_UNSUPPORTED_FLAG TBD (value - 312) Check that the page size is <= the compression value in “ibm,max-sync-cop” else return H_Constrained. Build CRB for compression of the page size indicated in the PTE If the checksum flag is on command that a checksum be built Verify that the “out” parameter represents a valid logical real address within the caller’s partition else return H_P3 If the “outlen” parameter is non-negative verify that the logical real address of (out + outlen) is a valid logical real address within the same 4K page as the “out” parameter else return H_P4. If the “outlen” parameter is negative: Verify that the absolute value of outlen meet all of the follow else return H_P4: Is <= the value of “ibm,max-sg-len” Is an even multiple of 16 That out + the absolute value of outlen represents a valid logical real address within the same 4K page as the out parameter. Verify that each 16 byte scatter gather list entry meets all of the following else return H_SG_LIST: Verify that the first 8 bytes represents a valid logical real address within the caller’s partition. Verify that the logical real address represented by the sum of the first 8 bytes and the second 8 bytes is a valid logical real address within the same 4K page as the first 8 bytes. For the Shared Logical Resource Option if any of the memory represented by the out/outlen parameters have been rescinded then return H_RESCINDED. Fill in the destination DDE list from the converted the out/outlen parameters. Issue icswx instruction to execute CRB Check coprocessor busy - retry / return H_PARTIAL if execution time expired / return H_COP_HW if compressor is broken Wait for coprocessor to complete If compressor hardware error return H_COP_HW Check that the compressor had enough room to house the compressed image else return H_TOO_BIG Save compression block size in R6 Return H_Success

H_ENTER This hcall adds an entry into the page frame table. PTEH and PTEL contain the new entry. PTEX identifies either the page table entry group or the specific PTE where the entry is to be added, depending upon the setting of the Exact flag. If the Exact flag is off, the hypervisor selects the first free (invalid) PTE in the page table entry group. For pages with sizes less than or equal to 64 K, Flags further provide the option to zero the page, and provide two levels of programmed I-Cache coherence management before activating the page table mapping. This hcall returns the PTE index of the entered mapping. If the PFO option is implemented an optional compression flag causes the hypervisor to initialize the page from one or more compressed data blocks and optionally (checksum flag) check the end to end block data integrity prior to adding the entry to the page table. If the compression flag is not set the checksum flag is ignored. If the Memory Usage Instrumentation (MUI) option is implemented, flags allow for initializing MUI state for the page when the translation is entered. Syntax: Parameters: Flags CEC Cookie Zero Page: Zero the System Memory page in real mode before placing its mapping into the PTE. This flag is ignored for memory mapped I/O space pages; as an attempt to zero missing memory might result in a machine check or worse. This function should use a processor dependent algorithm optimized for maximum performance on the specific hardware. This usually is a sequence of dcbz instructions. Setting this flag for a page with a size larger than 64 K will result in return code of H_TOO_BIG. I-Cache-Invalidate: Issue an icbi etc. instruction sequence to manage the I-Cache coherency of the cachable page. This flag is ignored for memory mapped I/O pages. For use when the D-Cache is known to be clean, before placing its mapping into the PTE. Setting this flag for a page with a size larger than 64 K will result in return code of H_TOO_BIG. I-Cache-Synchronize: Issue dcbst and icbi, etc., instruction sequence to manage the I-Cache coherency of the cachable page. This flag is ignored for memory mapped I/O pages. For use when the D-Cache may contain modified data, before placing its mapping into the PTE. Setting this flag for a page with a size larger than 64 K will result in return code of H_TOO_BIG. Exact: Place the entry in the exact PTE specified by PTEX if it is empty else return H_PTEG_FULL. For the CMO option: CMO Option flags as defined in . For the MUI option: The HBA bits specify settings of implementation dependent PTE bits and associated MUI array entries for the page who’s translation is being entered. For the MUI option: The Affinity-Clear and Page-Age-Clear bits clear associated MUI array entries for the page who’s translation is being entered. PTEX (index of the first PTE in the page table entry group to be used for the PTE insertion) PTEH -- the high order 8 bytes of the page table entry. PTEL -- the low order 8 bytes of the page table entry. Semantics: The hypervisor checks that the logical page number is within the bounds of partition allocated memory resources, else returns H_Parameter. If the Shared Logical Resource option is implemented and the logical page number represents a page that has been rescinded by the owner, return H_RESCINDED. The hypervisor checks that the address boundary matches the setting of the input PTE’s large page bits; else return H_Parameter. The hypervisor checks that the page size described by the setting of the input PTE’s page size bits is less than or equal to the largest page size supported by the logical region that is being mapped; else return H_Parameter. The hypervisor checks that the WIMG bits within the PTE are appropriate for the physical page number else H_Parameter return. (For System Memory pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages WIMG=01**.) For pages with sizes greater than 64 K, the hypervisor checks that the Zero Page, I-Cache-Invalidate, and I-Cache_Synchronize bits of the Flags parameter are not set; else return H_TOO_BIG. Force off RS mode reserved PTEL bits (1 In addition, bits 52 and 53 are forced off on platforms that either do not contain an “ibm,processor-storage-keys” property, or contain an “ibm,processor-storage-keys” property with the value of zero in both cells. Bit 0 is forced off on platforms that do not have the “Support for the “110” value of the Page Protection (PP) bits” bit set to a value of 1 in the “ibm,pa-features” property. ) as well as hypervisor reserved software bits (57 and 58) in PTEH. If the Exact flag is off, set the low order 3 bits of the PTEX to zero (insures that the algorithm stays inside partition’s PFT and is faster than a check and error code response). If the Zero Page flag is set, use optimized routine to clear page (usually series of dcbz instructions). For the PFO option: if the compression flag is on then Check that the calling partition is authorized to use the compression co-processor else return H_Function. If the page is not “main store memory” then return H_UNSUPPORTED_FLAG. Build CRB for decompression If the checksum flag is on command that a checksum be verified. Validate the inlen/in parameters and build the source DDE Verify that the “in” parameter represents a valid logical real address within the caller’s partition else return H_P4 If the “inlen” parameter is non-negative verify that the logical real address of (in + inlen) is a valid logical real address within the same 4K page as the “in” parameter else return H_P5. If the “inlen” parameter is negative: Verify that the absolute value of inlen meet all of the follow else return H_P5: Is <= the value of “ibm,max-sg-len” Is an even multiple of 16 That in + the absolute value of inlen represents a valid logical real address within the same 4K page as the in parameter. Verify that each 16 byte scatter gather list entry meets all of the following else return H_SG_LIST: Verify that the first 8 bytes represents a valid logical real address within the caller’s partition. Verify that the logical real address represented by the sum of the first 8 bytes and the second 8 bytes is a valid logical real address within the same 4K page as the first 8 bytes. Verify that the sum of all the scatter gather length fields (second 8 bytes of each 16 byte entry) is <= the decompression value in “ibm,max-sync-cop” else return H_TOO_BIG. For the Shared Logical Resource Option if any of the memory represented by the in/inlen parameters have been rescinded then return H_RESCINDED. Fill in the source DDE list from the converted the in/inlen parameters. Build the destination DDE referencing the start of the PTE page with the length of the PTE page size. Issue icswx instruction to execute CRB Check coprocessor busy - retry / return H_Busy if execution time exhausted / return H_Hardware if compressor is broken Wait for coprocessor to complete If compressor ran out of destination space return H_TOO_BIG Check that the decompression filled the full page else return H_Aborted If the checksum flag is on check that the data is valid else return H_BAD_DATA If hardware error return H_Hardware If the I-Cache-Invalidate flag is set, issue icbi instructions for all of the page’s cache lines If the Cache-Synchronize flag is set, issue dcbst and icbi instructions for all of the page’s cache lines. Implementations may need to issue a sync instruction to complete the coherency management of the I-Cache. The hypervisor selects a PTE within the page table entry group using the following. Algorithm: if Exact flag is on then set t to 0 else set t to 7 for i=0;i<= t; i++ Combine page table base, PTEX and offset base on (i) into R3 R8 <- ldarx PTEH(R3) /* prepare to take a lock on the PTE */ if PTE is valid (R8 (bit 63) is set) then continue if PTE is locked (R8 (bit 57) is set) then continue set R8 (bit 57) /* prepare to lock PTE */ PTEH(R3) <- stdcx R8 /* attempt to take lock */ if stdcx failed continue goto insert return H_PTEG_FULL insert: use code sequence from PA Book III construct return PTEX (R4 <- (R3 - PFTbase) shifted down 4 places) For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . return H_Success

H_READ This hcall returns the contents of a specific PTE in registers R4 and R5. Syntax: Parameters: flags: CEC Cookie: Cross CEC PFT access READ_4: Return 4 PTEs R-XLATE: Include a valid logical page number in the PTE if the valid bit is set, else the contents of the logical page number field is undefined. For the CMO option: CMO Option flags as defined in . PTEX (index of the PTE in the page table to be used -- if the READ_4 flag is set the low order two bits of the PTEX are forced to zero by the hypervisor to insure that they are in the range of the PTEG and it is faster than checking.) Semantics: Checks that the PTEX is within the defined range of the partition’s PFT else return H_Parmaeter If the READ_4 bit is clear Then load the specified PTE into R4 and R5 If R-XLATE flag is set, then reverse translate the RPN field into the logical page number. Else clear the two low order bits of the PTEX (faster than checking them) load the 4 PTEs starting at PTEX into R4 through R11. If R-XLATE flag is set, then reverse translate the RPN fields into the logical page number. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . Set H_Success in R3 and return

H_CLEAR_MOD This hcall clears the modified bit in the specific PTE. The second double word of the old PTE is returned in R4. Syntax: Parameters: flags: For the CMO option: CMO Option flags as defined in . PTEX (index of the PTE in the page table to be used) Semantics: Check that the PTEX accesses within the PFT, else returns H_Parameter Check that the “V” bit is one, else return H_Not_Found. Fetch the low order double word of the PTE into R4. If the “C” bit is zero, then return H_Success. The hypervisor synchronizes the PTE specified by the PTEX, clears the mod bit, and returns its old value: Use the architected “Modifying a Page Table Entry General Case” sequence from PA Book III. Only PTE bits to be modified are: In double word 0 SW bit 57 and the V bit (63) In double word 1, C bit (56). Use the proper tlbie instruction for the page size (per large page flag within PTE) within a critical section protected by the proper lock. The second double word of the old PTE value ends up in R4. At the point where the new values are to be activated, use the old values with the “C” bit cleared. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . Return H_Success

H_CLEAR_REF This hcall clears the reference bit in the specific PTE. The second double word of the old PTE is returned in R4. Syntax: Parameters: flags: For the CMO option: CMO Option flags as defined in . PTEX (index of the PTE in the page table to be used) Semantics: Check that the PTEX accesses within the PFT, else return H_Parameter. Check that the “V” bit is one, else return H_Not_Found. Only PTE bits to be modified are: In double word 1 the R bit (55) Use the architected “Resetting the Reference Bit” sequence from PA Book III with the original second double word of the PTE ending up in R4. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . Return H_Success

H_PROTECT This hcall sets the page protect bits in the specific PTE. Syntax: Parameters: flags: AVPN, pp0 The pp0 portion of the flags parameter is ignored on platforms that do not have the “Support for the “110” value of the Page Protection (PP) bits” bit set to a value of 1 in the “ibm,pa-features” property. , pp1, pp2, key0-key4 The key0-key4 portion of the flags parameter is ignored on platforms that either do not contain an “ibm,processor-storage-keys” property, or contain an “ibm,processor-storage-keys” property with the value of zero in both cells. , n, and for the CMO option: CMO Option flags as defined in . PTEX (index of the PTE in the page table to be used) AVPN: Optional “Abbreviated Virtual Page Number” -- used as a check for the correct PTE when the AVPN flag is set. Semantics: Check that the PTEX accesses within the PFT, else return H_Parameter Check that the “V” bit is one, else return H_Not_Found. If the AVPN flag is set, and the AVPN parameter bits 0-56 do not match that of the specified PTE, then return H_Not_Found. The hypervisor synchronizes the PTE specified by the PTEX, sets the pp0 The pp0 bit is not modified on platforms that do not have the “Support for the “110” value of the Page Protection (PP) bits” bit set to a value of 1 in the “ibm,pa-features” property. , pp1, pp2, key0-key4 The key0 - key4 bits are not modified on platforms that either do not contain an “ibm,processor-storage-keys” property, or contain an “ibm,processor-storage-keys” property with the value of zero in both cells. , and n bits per the flags parameter. Only PTE bits to be modified are: In double word 0 SW bit 57 and the V bit (63) In double word 1 pp0, pp1, pp2, key0-key4, and n Use the architected “Modifying a Page Table Entry General Case” sequence. Use the proper tlbie instruction for the page size (per value in PTE) within a critical section protected by the proper lock. At the point where the new values are to be activated use the old values with the “R” bit cleared and the pp0, pp1, pp2, key0-key4, and n bits set as specified in the flags parameter. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . Return H_Success

H_BULK_REMOVE This hcall is for invalidating up to four entries in the page table. The PTEX in the translation specifier high parameters identifies the specific page table entries. Prototype: Syntax: Translation specifiers: Each is 16 bytes long made up of two 8 byte double words; a translation specifier high and a translation specifier low. Translation Specifier High double word: First byte (0) is a control/status byte: High order two bits (0 and 1) are type code: 00 Unused -- if found stop processing and return H_PARAMETER 01 Request -- Processes As per H_REMOVE as modified by low order two control bits. 10 Response -- written by hypervisor as a return status from processing individual “request” translation specifier 11 End of String -- if found stop processing and return H_Success. Next two bits (2 and 3) are response code (in response to processing an individual “request” translation specifier (type code modified to 10)): 00 Success -- the specified translation was removed as per H_REMOVE with the PTE's RC bits in the next two status bits. 01 Not found -- the specified translation was not found as per H_REMOVE. 10 H_PARM -- one or more of the parameters of the specified translation were invalid per H_REMOVE -- processing of the bulk entries stops at this point and the hypervisor returns H_PARAMETER. 11 H_HW -- The hardware experienced an uncorrected error processing this translation specifier -- processing of the bulk entries stops at this point and the hypervisor returns H_HARDWARE. Next two bits (4 and 5) are the Reference/Change bits from the removed PTE (These bits are only valid if bits 0-3 are 1000): Low order two bits (6 and 7) are request modification flags: 00 absolute -- remove the specified PTEX entry unconditionally 01 andcond -- remove the specified PTEX entry as with the andcond flag of H_REMOVE 10 AVPN -- remove the specified PTEX entry as with the AVPN flag of H_REMOVE 11 not used -- if found stop processing and return H_PARAMETER. Bytes 1 through 7 are the PTEX (PFT byte offset divided by 16) Translation Specifier Low double word: Bytes 0 through 7 are the AVPN as per H_REMOVE Semantics: For each translation specifier, while the translation specifier is not “end of string”: Check that the PTEX accesses within the PFT else set H_PARM response status in the specific translation specifier high register and return H_Parameter If the AVPN flag is set, and the AVPN parameter bits 0-56 do not match that of the specified PTE then set response status Not found in the specific translation specifier high register, Continue. If the andcond flag is set, the AVPN parameter is bit anded with the first double word of the specified PTE (after bits 57-63 of the PTE have been masked), if the result is non-zero, then set response status Not found in the specific translation specifier high register, Continue. (Note the low order 7 bits of the AVPN parameter should be zero otherwise the likely result is a response status of Not found). The hypervisor Synchronizes the PTE specified by the PTEX. Use the architected “Deleting a Page Table Entry” sequence. Use the proper tlbie instruction for the page size within a critical section protected by the proper lock (per large page bit in the specified PTE). The synchronized value of the old PTE RC bits ends up in bits 4 and 5 of the individual translation specifier high register along with success response status. return H_Success

H_BLOCK_REMOVE This hcall removes up to eight sequential virtual page table entries. Some platforms that support this hcall() might remove fewer than 8 entries for a given actual page size / base page size combination as communicated by the “Block Invalidate Characteristics” system parameter (see ). The virtual pages are all within the same naturally aligned 8 page virtual address block and have the same page and segment size encodings. The AVA parameter, if used, covers the entire block of virtual page addresses. If another processor is currently accessing the page table entry, the entry is not removed. The availability of is hcall() might change after partition migration, the caller should be prepared for an H_Function return code. The PTEX field of the translation specifier parameters identifies the specific page table entries. Syntax: The AVA parameter is the 8 byte AVPN as per H_REMOVE. Each Translation Specifier is 8 bytes long: H_BLOCK_REMOVE Translation Specifier Byte 0 Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 Byte 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Control 0 Reserved PTEX (PFT byte offset divided by 16)

H_BLOCK_REMOVE Control Byte Format Control Description Type 00 Unused 01 Request 0 absolute -- remove the specified PTEX entry unconditionally 1 AVPN -- remove the specified PTEX entry as with the AVPN flag of H_REMOVE Page State 0 0 Inhibit page usage state change 0 1 Reserved 1 0 For CMO option set page usage state to “Unused” if Success 1 1 For CMO option set page usage state to “Loaned” if Success 10 Response 0 0 0 Success -- the specified translation was removed as per H_REMOVE with the PTE's RC bits in the next two status bits. 0 0 1 Not found -- the specified translation was either not found a s per H_REMOVE, Invalid (V bit = 0), or entry was “bolted” (PTE bit 59 = 1) 0 1 0 H_PARM Parameter is invalid 0 1 1 Inconsistent Page/Segment Size (does not match the L||LP and B fields of the block anchor Page Table Entry) 1 0 0 Busy (The page table entry is being modified by another processor) 1 0 1 Cross Boundary (The page table entry crosses an 8 page virtual address boundary) 1 1 0 Beyond Capacity (The page table entry exceeded the number supported on this platform) 1 1 1 The hardware experienced an uncorrected error processing this translation specifier -- processing of the bulk entries stops at this point and the hypervisor returns H_HARDWARE. R Reference bit from the removed PTE (bit is only valid if bits 0-4 are 10000) R Reference bit from the removed PTE (bit is only valid if bits 0-4 are 10000) 11 End of String -- if found stop processing and return the accumulated return code.

Semantics: Initialize return code to H_Success (overwritten if appropriate) Initialize “anchored” flag to false and PTECOUNT to zero. For each translation specifier, while the translation specifier type is not “end of string”: If the translation specifer type is not “Request” Return H_PARAMETER. Check that the PTEX accesses within the PFT else set H_PARM response status in the specific translation specifier high register set return code to H_PARTIAL and Continue. If the lock for the associated page table entry can not be immediately obtained, then set the TSn response code to “Busy”, set return code to H_PARTIAL and Continue. If the PTEX specified entry is either invalid (PTE V bit = 0) or “bolted” (PTE bit 59 = 1) then set response status “Not found” in the specific translation specifier high register, set return code to H_PARTIAL and Continue. Check that actual page size / base page size combination of the PTEX specified entry is supported for H_BLOCK_REMOVE as communicated by the “Block Invalidate Characteristics” system parameter else set H_PARM response status in the specific translation specifier high register set return code to H_PARTIAL and Continue. If the AVPN flag is set, and the AVPN parameter bits 0-56 do not match that of the specified PTE then set response status “Not found” in the specific translation specifier high register, set return code to H_PARTIAL and Continue. If NOT Anchored: then: Establish the block L||LP Establish the block segment size encoding For the CMO option: if the TS Control byte Page State bits are a reserved value then set H_PARM response status in the specific translation specifier high register, set return code to H_PARTIAL and Continue; else If the block segment encoding is an MPSS segment then set the page usage state for the large page per the CMO Page State bits of the TS Control byte; else set the page usage state per the CMO Page State bits of the TS specified page per the TS Control byte. Establish the block plus high order virtual address Establish the number of TLBs that the platform can invalidate in one operation from the associated page table entry Set the “anchored” flag to true; else: If the associated page table entry L||LP and segment size encoding does not match the established anchored values then set the TSn response code to “Inconsistent Page/Segment Size“, set return code to H_PARTIAL and Continue. If the associated page table entry high order virtual address of the 8 page block does not match the established anchored values then set the TSn response code to “Cross Boundary“, set return code to H_PARTIAL and Continue. If PTECOUNT is greater than the number of TLBs that the platform can invalidate in one operation, then set the TSn response code to “Beyond Capacity“, set return code to H_PARTIAL and Continue. For the CMO option: if the TS Control byte Page State bits are a reserved value then set H_PARM response status in the specific translation specifier high register, set return code to H_PARTIAL and Continue; else if the block segment size encoding is not MPSS then set the page usage state per the CMO Page State bits of the TS Control byte. Add the PTEX to the validated list of PTEX’s to be removed Increment PTECOUNT The hypervisor resets the valid bit in the PTEs specified by the validated list of PTEX’s to be removed. The hypervisor issues a single instance of the PTE Synchronization sequence outlined in the architecture Book IIIS under “Deleting a Page Table Entry” using the proper tlbie instruction for the page size within a critical section protected by the proper lock (per large page bit in the specified PTE) to cover all the PTEs specified by the validated list of PTEX’s to be removed. The synchronized value of the old PTE RC bits, for the PTEs specified by the validated list of PTEX’s to be removed, ends up in bits 5 and 6 of the individual translation specifier high register along with success response status. Release acquired page table entry locks. Return the accumulated return code and TS values.

Hash Page Table Resize Option The hash page table (HPT) for an operating system needs to be sized depending on the size of the partition's memory. The usual rule of thumb is that the HPT should be 1/64th of the size of memory (although Linux will typically work well with 1/128th or even less depending on available page sizes). An HPT which is too small will lead to poor performance, or even crashes, if the OS is unable to fit necessary bolted mappings into the table. An HPT which is too large wastes memory and leads to slower TLB misses due to increased cache misses on table walks. With the size of the HPT fixed at boot, a partition allowing memory reconfiguration must size the HPT according to the partition's maximum possible memory size. If the partition has a very large potential maximum memory size, but is unlikely to reach that in practice, this can lead to significant wastage of resources in the oversized hash table. By allowing a partition to change its HPT size at runtime, it can start with an HPT sized just for its initial memory, and change it if necessary when more memory is added dynamically. If the platform supports the Hash Page Table Resize Option, then it supports the two hcalls defined in this section, which allow an OS to request that its HPT should be resized. The resize operation is performed in two phases, the “prepare” phase and the “commit” phase. The prepare phase may take place concurrently with normal guest operation. The commit phase requires that the guest perform no concurrent updates or accesses to the HPT (which in practice requires no non-real mode memory accesses). During operation a partition may have a “Pending HPT”, a block of platform memory organized as a Power hash page table which may become the partition's HPT in future. The following data are associated with a Pending HPT: Does a Pending HPT currently exist? The Pending HPT's size Flags associated with the Pending HPT (this is for future extension, no flags are currently defined) Whether the Pending HPT is fully prepared or not In order to prevent a partition from tying up platform resources indefinitely with a Pending HPT, the platform is permitted to discard a Pending HPT at any time. Operating systems should be prepared to deal with a failure of either hcall because of this. The platform is permitted to start a partition with its HPT sized for the current memory allocation, rather than the maximum memory for the partition, provided that if the OS indicates via the ibm,client-architecture-support call that it does not support HPT resizing, the platform must then resize the HPT according to the partition's maximum memory, using a reconfiguration reboot if necessary.

H_RESIZE_HPT_PREPARE This hcall controls the prepare phase of HPT resizing. After successful completion of this hcall, the partition has a Pending HPT which can be made the partition's current HPT. Syntax: Parameters: flags: 0, as no flags are currently defined. shift: Log base 2 of the total size in bytes of the requested new HPT, either 0 (used to discard a Pending HPT) or else between 18 and 46. Semantics: Check that the partition is permitted to resize its HPT, else return H_AUTHORITY. Check if there is a Pending HPT; if there is, then: If the Pending HPT size and flags match the parameters requested in this call, then: If the Pending HPT is not fully prepared, return H_LONG_BUSY_xxx with an estimate of the time remaining to complete preparation of the Pending HPT If preparation of the Pending HPT has terminated due to two bolted HPTEs needing to occupy the same slot of the same HPTEG, then return H_PTEG_FULL Else return H_SUCCESS Else discard the Pending HPT and continue If the flags parameter is non-zero, then return H_PARAMETER If shift is zero, then return H_SUCCESS If (shift < 18) or (shift > 46), then return H_PARAMETER Check that the partition is permitted to have an HPT of 2shift bytes; if not, return H_RESOURCE Create a new Pending HPT of size 2shift bytes. Preparation of the new HPT may continue asynchronously. If the Pending HPT is not fully prepared, return H_LONG_BUSY_xxx with an estimate of the time remaining to complete preparation. Else return H_SUCCESS.

H_RESIZE_HPT_COMMIT This hcall executes the commit phase of HPT resizing, making the Pending HPT the partition's current HPT. The caller must ensure that while it is executing, none of the partition's virtual CPU threads will update or access the HPT; that is, all threads must be executing in real mode, or stopped. Syntax: Parameters: flags: 0, as no flags are currently defined. shift: Log base 2 of the total size in bytes of the requested new HPT, between 18 and 46. Semantics: Check that the partition is permitted to resize its HPT, else return H_AUTHORITY. Check if there is a Pending HPT; if there is not, then return H_CLOSED. Check that the flags parameter is zero and the shift parameter matches the size of the Pending HPT; if not, then return H_CLOSED. Check that the Pending HPT is fully prepared; if not, return H_BUSY. If preparation of the Pending HPT was terminated due to finding two bolted HPTEs that need to occupy the same slot of the same HPTEG, then return H_PTEG_FULL. Ensure that all bolted HPTEs from the partition's existing HPT also exist, correctly hashed, in the Pending HPT. HPTEs transferred from the existing HPT must have the same slot within their HPTEG in the Pending HPT as they did in the existing HPT. If the Pending HPT is smaller than the existing HPT, it is possible that two bolted HPTEs that are in the same slot of two separate HPTEGs in the existing HPT need to be put into the same HPTEG in the Pending HPT. If this occurs, return H_PTEG_FULL. Discard the partition's existing HPT. Make the Pending HPT be the partition's current HPT. Mark the partition as having no Pending HPT. Return H_Success.

Translation Control Entry Access The Translation Control Entry (TCE) access hcall()s take as a parameters the Logical I/O Bus Number (LIOBN) that is the logical bus number value derived from the “ibm,dma-window” property associated with the particular IOA. For the format of the “ibm,dma-window” property, reference .

H_GET_TCE This hcall() returns the contents of a specified Translation Control Entry. Syntax: Parameters: LIOBN (Logical I/O Bus Number for TCE table to be accessed) IOBA (I/O Bus Address for indexing into the TCE table) Semantics: If the LIOBN, or IOBA are outside of the range of calling partition assigned values return H_PARAMETER. If the Shared Logical Resource option is implemented and the LIOBN, or IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. Load R4 with the specified TCE contents. If specified TCE’s Page Mapping and Control bits (see ) specify “Page Fault” then return H_Success Reverse translate the TCE’s RPN field into its logical page number If the logical page number is owned by the calling partition then replace the RPN field of R4 with the logical page number and return H_Success. Logically OR the contents of R4 with 0xFFFFFFFFFFFFF000 placing the result into R4. Return H_PARTIAL.

H_PUT_TCE This hcall() enters the mapping of a single 4 K page into the specified Translation Control Entry. Syntax: Semantics: If the LIOBN or IOBA parameters are outside of the range of calling partition assigned values return H_PARAMETER. If the Shared Logical Resource option is implemented and the LIOBN, or IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. If the Page Mapping and Control field of the TCE is not “Page Fault” (see ) Then if the logical address within the TCE parameter is outside of the range of calling partition assigned values Then return H_PARAMETER. Else translate the logical address within the TCE parameter into the corresponding physical real address. The hypervisor stores the TCE resultant value in the TCE table specified by the LIOBN and IOBA parameters; returning H_Success. (In the “Page Fault” case the RPN remains untranslated.) Software Note: The PA requires the OS to issue a sync instruction to proceed the signalling of an IOA to start an IO operation involving DMA to guarantee the global visibility of both DMA and TCE data. This hcall() does not include a sync instruction to guarantee global visibility of TCE data and in no way diminishes the requirement for the OS to issue it.

H_STUFF_TCE This hcall() duplicates the mapping of a single 4 K page through out a contiguous range of Translation Control Entries. Thus, in initializing and/or invalidating many entries. To retain interrupt responsiveness this hcall() should be called with a count parameter of no more than 512, LoPAR architecture provides enforcement for this restriction to aid in client code debug. Syntax: Semantics: If the LIOBN, or IOBA, are outside of the range of calling partition assigned values return H_PARAMETER. If the Shared Logical Resource option is implemented and the LIOBN, or IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. If the count parameter is greater than 512 then return H_P4 If the count parameter added to the TCE index specified by IOBA is outside of the range of the calling partition assigned values return H_PARAMETER. If the Page Mapping and Control field of the TCE is not “Page Fault” (see ) Then if the logical address within the TCE parameter is outside of the range of calling partition assigned values Then return H_PARAMETER. If the Shared Logical Resource option is implemented and the logical address’s page number represents a page that has been rescinded by the owner, return H_RESCINDED. Else translate the logical address within the TCE parameter into the corresponding physical real address. The hypervisor stores the TCE resultant value in the TCE table entries specified by the LIOBN, IOBA and count parameters; returning H_Success. (In the “Page Fault” case the RPN remains untranslated.) Implementation Note: The PA requires the OS to issue a sync instruction to proceed the signaling of an IOA to start an IO operation involving DMA to guarantee the global visibility of both DMA and TCE data. This hcall() does not include a sync instruction to guarantee global visibility of TCE data and in no way diminishes the requirement for the OS to issue it.

H_PUT_TCE_INDIRECT This hcall() enters the mapping of up to 512 4 K pages into the specified Translation Control Entry. The LIOBN parameter if positive is the cookie (LIOBN) of the specific TCE table to load. For the Multi-TCE Table (MTT) option, if the LIOBN parameter is negative, CNT = the absolute value of LIOBN (up to 128), and the first CNT 8 byte entries of the buffer referenced by the TCE parameter contains the TCE table cookies (LIOBNs) for the various TCE tables to load (up to a maximum of 128 TCE tables). Note: Users of the MTT option that are subject to partition migration should be prepared for the loss of support for the MTT option after partition migration. Syntax: Semantics: /* Validate the input parameters */ If the LIOBN parameter is non-negative then do If the count parameter is > 512 then return H_Parameter. If the Shared Logical Resource option is implemented and the LIOBN parameter represents a TCE table that has been rescinded by the owner, return H_RESCINDED. If the LIOBN parameter represents a TCE table that is not valid for the calling partition, return H_Parameter. Liobns[0] = the LIOBN parameter. If the Shared Logical Resource option is implemented and any of the I/O bus address range represented the IOBA parameter plus count pages within the TCE table represented by the LIOBN parameter represents rescinded resource, return H_RESCINDED. If any of the I/O bus address range represented the IOBA parameter plus count pages within the TCE table represented by the LIOBN parameter is not valid for the calling partition then return H_Parameter. end Else do If the MTT Option is not enabled return H_Function. If the LIOBN parameter < -128 then return H_Parameter. If the sum of the count parameter plus |LIOBN| is > 512 then return H_Parameter. end If the Shared Logical Resource option is implemented and the TCE parameter represents a logical page address of a page that has been rescinded by the owner, return H_RESCINDED. If the TCE parameter represents the logical page address of a page that is not valid for the calling partition, return H_Parameter. Copy the contents of the page referenced by the TCE table to a temporary hypervisor page (Temp) for validation without the potential for caller manipulation. /* Validate the indirect parameters */ VAL= 0 If the LIOBN parameter is negative then do For CNT = 1,|LIOBN|,1 T = 8 byte entry Temp [VAL] If the Shared Logical Resource option is implemented and T as an LIOBN represents a TCE table that has been rescinded by the owner, return H_RESCINDED. If T as an LIOBN represents a TCE table that is not valid for the calling partition, return H_Parameter. Liobns[VAL+] = T. If the Shared Logical Resource option is implemented and any of the I/O bus address range represented the IOBA parameter plus count pages within the TCE table represented by “T” as an LIOBN represents a rescinded resource, return H_RESCINDED. If any of the I/O bus address range represented the IOBA parameter plus count pages within the TCE table represented by “T” as an LIOBN is not valid for the calling partition then return H_Parameter. loop end /* Translate the logical pages addresses to physical*/ for CNT = 1,count,1 T = 8 byte entry Temp [VAL+] If the Page Mapping and Control field of the 8 byte entry “T” is not “Page Fault” (see ) then do If the Shared Logical Resource option is implemented and the value of “T” as a logical address represents a page that has been rescinded by the owner, then return H_RESCINDED. If “T” as a logical address is outside of the range of calling partition assigned values then return H_PARAMETER. Translate the logical address within the TCE buffer entry into the corresponding physical real address. Temp[CNT - 1] = translated physical real address. end loop /* Fill the TCE table(s) */ If LIOBN parameter is negative then VAL = |LIOBN| else VAL = 1. For TABS = 1, VAL, 1 The TCE table to fill is that referenced by Liobns[VAL] as an LIOBN. INDEX = the page index within the TCE table represented by the IOBA parameter. For CNT = 1, count, 1 TCE_TABLE [Liobns[VAL], INDEX+] = Temp [CNT-1] Loop Loop Return H_Success. Implementation Note: The PA requires the OS to issue a sync instruction to proceed the signaling of an IOA to start an IO operation involving DMA to guarantee the global visibility of both DMA and TCE data. This hcall() does not include a sync instruction to guarantee global visibility of TCE data and in no way diminishes the requirement for the OS to issue it.

Processor Register Hypervisor Resource Access Certain processor registers are architecturally hypervisor resources, in the following cases the hypervisor provides controlled write access services.

H_SET_SPRG0 Syntax: Parameters: value: The value to be written into SPRG0. No parameter checking is done against this value.

H_SET_DABR Note: Implementations reporting compatibility to ISA versions less than 2.07 are required to implement this interface; however, this interface is being deprecated in favor of for newer implementations. Syntax: Semantics: If the platform does not implement the extended DABR facility then: Validate the value parameter else return H_RESERVED_DABR and the value in the DABR is not changed: The DABR BT bit (Breakpoint Translation) is checked for a value of 1. Else (The platform does implement the extended DABR facility): Load the DABRX register with 0b0011. place the value parameter into the DABR. Return H_Success.

H_PAGE_INIT Syntax: Parameters: flags: zero, copy, I-Cache-Invalidate, I-Cache-Synchronize, and for the CMO option: CMO Option flags as defined in . destination: The logical address of the start of the page to be initialized source: The logical address of the start of the page use as the source on a page copy initialization. This parameter is only checked and used if the copy flag is set. Semantics: The logical addresses are checked, they must both point to the start of a 4 K system memory page associated with the partition or return H_Parameter. If the Shared Logical Resource option is implemented and the source/destination logical page number represents a page that has been rescinded by the owner, return H_RESCINDED. If the zero flag is set, clear the destination page using a platform specific routine (usually a series of dcbz instructions). If the copy flag is set, execute a platform specific optimized copy of the full 4 K page from the source to the destination. If I-Cache-Invalidate flag is set, issue icbi instructions for all of the page’s cache lines If I-Cache-Synchronize flag is set, issue dcbst and icbi instructions for all of the page’s cache lines. Implementations may need to issue a sync instruction to complete the coherency management of the I-Cache. For the CMO option: set the page usage state per the CMO Option flags field of the flags parameter as defined in . Return H_Success Note: For the CMO option, the CMO option flags may be used to notify the platform of the page usage state of a page without regard to its hardware page table entry or lack there of independent of any other option flags.

H_SET_XDABR Note: Implementations reporting compatibility to ISA versions less than 2.07 are required to implement this interface; however, this interface is being deprecated in favor of for newer implementations. This hcall() provides support for the extended Data Address Breakpoint facility. It sets the contents of the Data Address Breakpoint Register (DABR) and its companion Data Address Breakpoint Register Extension (DABRX). A principal advantage of the extended DABR facility is that it allows setting breakpoints for LPAR addresses that the hypervisor had to preclude using the previous facility. Syntax: Semantics: Validates the extended parameter else return H_Parameter: Reserved Bits (0-59) are zero. The HYP bit (61) is off. The rest of the PRIVM field (Bits 62-63) is one of those supported: 0b01 Problem State 0b10 Privileged non-hypervisor 0b11 Privileged or Problem State (Specifying neither Problem or Privileged state is not supported) Load the validated extended parameter into the DABRX Load the value parameter into the DABR Return H_Success.

H_SET_MODE This hcall() is used to set hypervisor processing resource mode registers such as breakpoints and watchpoints. The modes supported by the hardware processor are a function of the processor architectural level as reported in the “cpu-version” property. presents the valid parameter ranges for the architectural level reported in the “cpu-version” property and the LoPAR architecture level as reported in the “/chosen” property. Setting breakpoints: A breakpoint is set for a hardware tread. Should the hardware thread complete an instruction who's effective address matches that of the set breakpoint a trace interrupt is signaled. When setting the breakpoint resource, the mflags and value2 parameters are zero. The value1 parameter is the effective address of the breakpoint to be set. Setting watchpoints: A watchpoint is set for a hardware tread. Should the hardware thread attempt to access within the specified double word range of the effective address specified by the value1 parameter as qualified by the conditions specified in the value2 parameter a Data Storage type interrupt is signaled. When setting the watchpoint resource, the mflags parameter is zero. The value1 parameter is the effective double word address of the watchpoint to be set. The value2 parameter specifies the qualifying conditions for the access, these are a subset of the POWER ISA conditions that are relevant within the context of a logical partition. This subset includes the MRD field, DW, DR, WT, WTI, PNH, and PRO bits. All other value2 fields are zero. Setting Interrupt Vector Location Modes: The Alternate Interrupt Location (AIL) Mode for the calling partition is set. Since this function has partition wide scope, it may take longer for the hypervisor to perform the function on all processors than is permissible during a synchronous call; therefore, the call might return long busy. In that case the caller should repeat the call with the same parameters after the specified time period until the H_SUCCESS return code is received. A call with different parameters indicates the beginning of a new partition wide mode setting. The desired AIL mode is encoded in the two low order mflags bits (all other mflags bits are 0) while both value1 and value2 parameters are zero. The POWER ISA requires that the setting the LPCR ILE bit be the same in all partition processors when not in hypervisor mode thus all partition processors need to be operating with MSR[EE] = 0 when changing LPCR ILE so that the OS can change the contents of the interrupt vectors prior to any interrupts being taken in the new mode. Syntax: Semantics: H_SET_MODE Parameters per ISA Level PAPR level Supported Resource Values Values Supported mflags Value 1 Value 2 Comments 2.7 1 None Breakpoint Address None 2 None Watchpoint Double Word Address Watchpoint Qualifying Conditions 3 0 None None IR=DR=0 No offset 1 None None Reserved 2 None None IR=DR=1 offset 0x18000 3 None None IR=DR=1 offset 0xC000 0000 0000 4000 All Others All Others All Others Reserved 2.8 All Others All Others All Others All Others Reserved

R1--1. For implementations supporting POWER ISA level 2.07 and beyond: the platform must implement the H_SET_MODE hcall() per the syntax and semantics of section per the LoPAR level supported.

Implementation Dependent Optimizations All platforms contain implementation specific switches and values that effect the performance of the platform. The default settings for switches and values are tuned during platform development to achieve the desired performance characteristics across a wide range of workloads. However, the performance of specific workloads might be further optimized by adjusting some of these implementation specific switches and values when those workloads are being run. Other of these switches and values might have negative effects on other platform workloads, so those switches and values are protected from modification lest innocent partitions become victims of one of their neighbors. LoPAR version 2.8 and above provide the hcall()s defined below to set a subset of the implementaion specific switches and adjust a subset of the tuning values within a range that has been proven to be safe. The caller is expected to understand the switch banks and resources implemented by the specific platform and the functinality of each individual switch and resource. Special consideration is required of the caller of these functions during partition migration and micro-checkpoint/ failover operations since the underlying implementation might change. During these operarations, the implementation dependent switches and values are set to their default values for the implementation that is receiving the partition. After a migration or failover event the availability of the Implementation Dependent Optimization hcall()s might change, along with the resources and/or switch banks that might be adjusted, and the supported values for those adjustments.

H_ADJUST_RESOURCE This hcall() is used to adjust the value of a given implementation dependent resource in contiguous unit steps between the minimum and maximum supported values. These steps are not necessarily uniform either in physical values set into the implementation dependent resource or the resultant effect they have on workload performance. Syntax: Semantics: If the value of the Resource parameter is zero then return H_PARAMETER If the value of the Resource parameter is greater than the maximum defined value then return H_RESERVED If the value of the Resource paramter is not supported on this implementation then return H_UNSUPPORTED RC = H_Success Current = current value of the Resource If Current + Value > max supported value of Resource then { RC = H_Constrained ; Current = max supported value of Resource } If Current + Value < min supported value of Resource then { RC = H_Constrained ; Current = min supported value of Resource } Set Resource to Current On return: R3: Contains Return Code (RC) R4: Contains the Resource value (Current) (Return codes H_Success & H_Constrained) R5: Contains the number of steps to minimum supported resource value R6: Contains the number of steps to maximum Suppported resource value

H_SET_SWITCHES This hcall() provides for the setting of an implementation dependent subset of the switches in an implementation dependent bank of switches. Syntax: Semantics: If the value of the Bank parameter is zero then return H_PARAMETER If the value of the Bank parameter is greater than the maximum defined value then return H_RESERVED If the value of the Bank paramter is not supported on this implementation then return H_UNSUPPORTED If (Mask & not (Supported-bits-in[Bank]) ) then RC = H_Constrained else RC = H_Success Turn on the bits in switches[Bank] that are ones in all three of Supported-bits-in[Bank], Mask, and Setting Turn off the bits in switches[Bank] that are ones in both Supported-bits-in[Bank] and Mask but are zeros in Setting On return: R3: Contains Return Code (RC) R4: Contains the Mask value representing all switches who's setting is supported for the bank R5: Contains the Bank value (Return codes H_Success & H_Constrained)

Debugger Support hcall()s The real mode debugger needs to be able to get to its async port and beyond the real mode limit register without turning on virtual address translation. The following hcall()s provide that capability.

H_LOGICAL_CI_LOAD Syntax: Parameters: size: The size of the cache inhibited load: byte = 1 half = 2 full = 4 double=8 All other size values are illegal and returns H_Parameter addr: The logical address of the cache inhibited location to be read. The hypervisor checks that the address is within the range of addresses valid for the partition, on a boundary equal to the requested length, is not to the location BA+4 within an interrupt management area, and mapped as cache inhibited (cache paradoxes are to be avoided)-- Else H_Parameter. On successful return (H_Success), the read value is low order justified in register R4.

H_LOGICAL_CI_STORE Syntax: Parameters: size: The size of the cache inhibited store: byte = 1 half = 2 full = 4 double=8 All other size values are illegal and returns H_Parameter addr: The logical address of the cache inhibited location to be written. The hypervisor checks that the address is within the range of addresses valid for the partition, on a boundary equal to the requested length, is not to the location BA+4 within an interrupt management area, and mapped as cache inhibited (cache paradoxes are to be avoided). value The value to be written is low order justified in register R6.

Virtual Terminal Support This section has been moved to . Architecture and Implementation Note: The requirement to provide the “ibm,termno” property in the /rtas node, has been removed (it is now necessary to look for vty nodes and use their unit address from the “reg” property to get the same information). The “ibm,termno” property called for sequential terminal numbers, but with the use of unit addresses from the “reg” property, such is not the case.

Dump Support hcall()s To allow the OS to dump hypervisor data areas in support of field problem diagnosis the hcall-dump support function set contains the H_HYPERVISOR_DATA hcall(). This hcall() is enabled or disabled (default disabled) via the Hardware Management Console. If the hcall-dump function set is disabled an attempt to make a H_HYPERVISOR_DATA hcall() returns H_Function. When the function is enabled, the hcall-dump function set is specified in the “ibm,hypertas-functions” property. The requester calls repeatedly starting with a control value of zero getting back 64 bytes per call and setting the control parameter on the next call to the previous call’s return code until the hcall() returns H_Parameter indicating that all hypervisor data has been dumped. The precise meaning of the sequence of data is implementation dependent. The H_HYPERVISOR_DATA hcall() need only return data in the firmware working storage that is not contained in the PFT or TCE tables since the contents of these tables are available to the OS. Starting with LoPAR Version 2.8 PAPR platforms support the H_CLEAR_HPT hcall() independent of Client Architecture support negotiation.

H_HYPERVISOR_DATA Syntax: Parameters: control: A value passed to establish the progress of the dump. Semantics: If the control value is zero, the data returned is the first segment of the hypervisor’s working storage, with a non-negative return code. If the control value is equal to the return code of the last H_HYPERVISOR_DATA call, and the return code is non-negative, the data returned in R4 through R11 is the next sequential segment of the hypervisor’s working storage. The contents of R4 through R11 are undefined if the return code is negative. Implementation Note: It is expected that the control value is be used by the H_HYPERVISOR_DATA routine as an offset into the hypervisor’s data area. For the expected implementation, hypervisor checks the value of the control parameter to insure that the resultant pointer is within hypervisor’s data area else it returns H_Parameter.

H_CLEAR_HPT This hcall() clears the hash page table (HPT) for a partition in preparation for a restart. The Virtual Real Mode Area and Partition Adjunct mappings are exempted. The performance class of this hcall() is “Terminal”, that is, it is allowed to take as long as it needs to perform the operation in a single call, however, it is also allowed to return H_CONTINUE, at which time the caller needs to again make the call until it receives H_Success else the partition HPT might be left in an inconsistent state. Never the less, the reason for this hcall() is to optimize the performance of this function relative to a series of H_REMOVE calls, therefore, hypervisors are encouraged to perform portions of the function in parallel using as many partition processor threads as is practical. The hypervisor clears the partition’s HT entries (sets them to invalid) except for those entries mapping the VRMA, or a Partition Adjunct, performs a TLBIA on all partition processor threads, and returns H_Success on the calling thread. To avoid translation exceptions, attempting to access pages whose translations are being cleared, all OS processor threads should be operating MSR[IR] = MSR[DR] = MSR[PR] = 0b0. Any attempt to use one of the HPT access hcall(s) (See ) during the clearing process might result in the an H_Busy return code, and/or the processor might be pressed into service clearing the HPT. Syntax: Semantics: Disable other HPT access hcall()s For each HPT entry If the entry does not map the VRMA or Partition Adjunct clear the V bit For each partition processor perform a TLBIA Enable other HPT access hcall()s Return H_Success

Interrupt Support hcall()s below describes legacy vs exploitation differences related to the client architecture, device tree and hcalls. The symbol @ESB refers to a logical real address returned from the hcall() . The symbol @TIMA refers to the logical real address in the “reg” property of the External Interrupt Virtualization Node. XIVE Legacy vs. Exploitation Mode Hypervisor Call Function Table Legacy Exploitation Mode Client Architecture Support Option Vector 5 Byte 23 bits 0-1 undefined or 0b00 /chosen “ibm,architecture-vec” Byte 23 bits 0-1 undefined or 0b00 See ibm,architecture vector 5, byte 23 in for more details. Client Architecture Support Option Vector 5 Byte 23 bits 0-1 undefined or 0b00 /chosen “ibm,architecture-vec” Byte 23 bits 0-1 value 0b01 “ibm,get-xive” hcall() “ibm,set-xive” hcall() hcall() “ibm,int-off” CI load double to @ESB + 0xD00 See XIVE specification for details on the CI operations. “ibm,int-on” CI load double to @ESB + 0xC00 “set-indicator” with indicator 9005 Same “get-sensor-state” with indicator 9005 Same H_EOI CI load double to @ESB + 0xC00 if store EOI is not enabled CI store double to @ESB + 0x400 if store EOI is enabled CI store byte to @TIMA + 0x11 of pre-interrupt CPPR H_CPPR CI store byte to @TIMA + 0x11 of new CPPR H_IPI CI store byte to @ESB + 0x00 H_IPOLL CI load byte from @TIMA + 0x10 H_XIRR CI load half word from @TIMA + 0x810 H_XIRR-X Deprecated H_VIO_SIGNAL Same The following are clarified in the hcall definition: A race between a VIO virtual adapter generating a new interrupt and a H_VIO_SIGNAL() or H_VIOCTL() hcall could have multiple outcomes. H_VIO_SIGNAL()/H_VIOCTL() wins and the interrupt does not occur, or the new interrupt wins and one device interrupt occurs after H_VIO_SIGNAL()/H_VIOCTL() call returns. Interrupting events that occur while H_VIO_SIGNAL()/H_VIOCTL() has disabled interrupts will never generate an interrupt. H_VIOCTL with sub functions: DISABLE_ALL_VIO_INTERRUPTS DISABLE_VIO_INTERRUPT ENABLE_VIO_INTERRUPT Same

Hypervisor Call Functions Unique to XIVE Exploitation Function Description HCALL Get the ESB addresses for a LISN Assign a target and priority to a LISN Get the target and priority assigned to a LISN Get the notification management page for a LISN Set/Reset an EQ for a target and priority Get the EQ set for a target and priority Set the OS reporting cache line pair for a target Get the OS reporting cache line pair for a target Load or store operation on the ESB page for a LISN Issue the requested sync Reset interrupt state to the initial state

Injudicious values written to the interrupt source controller may affect innocent partitions. The following hcall()s monitor the architected functions.

H_EOI Software Implementation Note: Issuing more H_EOI calls than actual interrupts may cause undesirable behavior, including but not limited to lost interrupts, and excessive phantom interrupts. Syntax: Parameters: xirr: The low order 32 bits is the value to be written into the calling processor’s interrupt management area’s external interrupt request register (xirr). Semantics: If the platform implements the Platform Reserved Interrupt Priority Level Option, and the priority field of the xirr parameter matches one of the reserved interrupt priorities then return H_Resource. If the value of the xirr parameter is such that the low order 3 bytes (xisr) is one of the interrupt source values assigned to the partition, and the high order byte xirr byte (cppr) is equal or less favored than the current cppr contents, then the value is written into the calling processor’s xirr causing the interrupt source controller to signal an “end of interrupt” (EOI) to the specified interrupt source logic, then hypervisor returns H_Success or H_Hardware (if an unrecoverable hardware error occurred). If the xirr value is not legal, hypervisor returns H_Parameter. If the Shared Logical Resource option is implemented and the xirr parameter represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. If the partition is not in XIVE legacy mode, the Hypervisor returns H_Function.

H_CPPR Syntax: Parameters: cppr: The low order byte is the value to be written into the calling processor’s interrupt management area’s current processor priority register (cppr). Semantics: If the platform implements the Platform Reserved Interrupt Priority Level Option, and the priority field of the xirr parameter matches one of the reserved interrupt priorities then return H_Resource. The value of the cppr parameter is written into the calling processor’s cppr causing the interrupt source controller to reject any interrupt of equal or less favored priority. Then hypervisor returns H_Success or H_Hardware (if an unrecoverable hardware error occurred). If the partition is not in XIVE legacy mode, the Hypervisor returns H_Function.

H_IPI Syntax: Parameters: server#: The server number gotten from the “ibm,ppc-interrupt-server#s” property associated with the processor and/or thread to be interrupted. mfrr: The priority value the inter-processor interrupt to be signaled. Semantics: If the platform implements the Platform Reserved Interrupt Priority Level Option, and the priority field of the xirr parameter matches one of the reserved interrupt priorities then return H_Resource. If the value of the server# parameter specifies of one of the processors in the calling processor’s partition, then the value in the low order byte of the mfrr parameter is written into the mfrr register (BA+12) of the processor’s interrupt management area causing that interrupt source controller to signal an “inter-processor interrupt” (IPI) to the processor associated with the specified interrupt management area. Hypervisor then returns H_Success or H_Hardware (if an unrecoverable hardware error occurred). If the server# value is not legal, hypervisor returns H_Parameter. If the Shared Logical Resource option is implemented and the server# parameter represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. If the partition is not in XIVE legacy mode, the Hypervisor returns H_Function.

H_IPOLL Syntax: Parameters: server#: The server number gotten from the “ibm,ppc-interrupt-server#s” property associated with the processor and/or tread to be interrupted. Semantics: If the value of the server# parameter specifies of one of the processors in the calling processor’s partition, then hypervisor reads the 4 byte contents of the processor’s interrupt management area port at offset BA+0 into the low order 4 bytes of register R4 and the one byte of the mfrr (BA+12) into the low order byte of R5. Reading these addresses has no side effects and is used to poll for pending interrupts. Hypervisor then returns H_Success or H_Hardware (if an unrecoverable hardware error occurred). If the server# value is not legal, hypervisor returns H_Parameter. If the Shared Logical Resource option is implemented and the server# parameter represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. If the partition is not in XIVE legacy mode, the Hypervisor returns H_Function.

H_XIRR / H_XIRR-X These hcall()s provide the same base function that is they return the interrupt source number associated with the external interrupt. H_XIRR-X further supplies the time stamp of the interrupt . Legacy implementations implement only H_XIRR, returning H_Function for a call to H_XIRR-X. POWER8 implementations also implement H_XIRR-X. Syntax: Parameters: H_XIRR: no input parameters defined. H_XIRR-X: cppr: the internal current processor priority of the calling virtual processor. Valid values in the range of 0x00 - most favored to 0xFF - least favored less those values specified by the “ibm,plat-res-int-priorities” property in the root node). Semantics: Hypervisor reads the 4 byte contents of the processor’s interrupt management area port at offset BA+4 into the low order 4 bytes of the register R4. Reading this address has the side effect of accepting the interrupt and raising the current processor priority to that of the accepted interrupt. Place the timestamp when the hypervisor first received the interrupt into R5. Hypervisor then returns H_Success or H_Hardware (if an unrecoverable hardware error occurred). If the partition is not in XIVE legacy mode, the Hypervisor returns H_Function.

H_INT_GET_SOURCE_INFO The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical real address of the MMIO page through which the Event State Buffer entry associated with the value of the “lisn” parameter is managed. The initial state of the ESB PQ bits will be the architected off value of 0b01. The “lisn” parameter can come from several different properties or hcalls. For example, the “lisn” parameter value for I/O adapters is passed to a partition through the “interrupts” and “interrupt-map” properties in the device tree node describing the I/O adapter. Alternatively, for inter processor interrupts, the “lisn” parameter is a value the OS chooses from a range of LISNs from the “ibm,xive-lisn-ranges” property. While for platform accelerators, the “lisn” parameter is a value returned by the allocating hcall(), H_ALLOCATE_VAS_WINDOW. Depending upon the specific Logical Interrupt Source, there might be either one or two page addresses assigned to the Logical Interrupt Source as indicated by the returned values of this hcall(). The hcall() returns four values in addition to the return code. The first value is the logical real address of the full function page address, which allows both trigger and reset functions. The second value is either the logical real address of the trigger only page, or the reserved value -1 (all ones). The value of -1 indicates that the Event Source Buffer does not have a trigger only page. See the XIVE specification for more details on the full function page versus the trigger only page. Syntax: Parameters: flags: bits 0-63 reserved “lisn” is per “interrupts”, “interrupt-map”, or “ibm,xive-lisn-ranges” properties, or as returned by the ibm,query-interrupt-source-number RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall Return Values: R4: “flags”: Bits 0-59: Reserved Bit 60: ESB hcall: ESB hcall==1, hcall() H_INT_ESB must be used for Event State Buffer management Bit 61: LSI: LSI==1, the interrupt associated with the “lisn” is a LSI (Level Sensitive Interrupt), LSI==0, the interrupt associated with the “lisn” is a MSI (Message Signaled Interrupt) Bit 62: Trigger: Trigger==1, the full function page supports trigger Bit 63: Store EOI Supported R5: Logical Real address of full function Event State Buffer management page, -1 if ESB hcall flag is set to 1. R6: Logical Real Address of trigger only Event State Buffer management page or -1 if ESB hcall flag is set to 1. R7: Power of 2 page size for the ESB management pages returned in R5 and R6. For example, a 4K page size is represented by the value of 12 (4K = 212). There is a minimum page size of 4K. Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “lisn” parameter per the list of interrupt sources allocated to the calling partition else return H_P2. Load R4 with the return flags, setting the reserved bits to 0. Load R5 with the logical real address of the full function Event State Buffer management page. If the associated Event State Buffer has two management pages defined load the logical real address of the trigger only page into R6 else load R6 with -1. Load R7 with the power of 2 page size of the ESB management pages. Return H_Success.

H_INT_SET_SOURCE_CONFIG The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical Interrupt Source to a target. The Logical Interrupt Source is designated with the “lisn” parameter and the target is designated with the “target” and “priority” parameters. Upon return from the hcall(), no additional interrupts will be directed to the old EQ. The old EQ should be investigated for nterrupts that occurred prior to or during the hcall(). Syntax: Parameters: “flags” Bits 0-61: Reserved Bit 62: setEisn: setEisn==1, set the “eisn” in the EA Bit63: M: m==1 masks the interrupt source in the hardware interrupt control structure. As defined in Section 3.7 "Processing an EAS" in the XIVE Specification. An interrupt masked by this mechanism will be dropped, but it's source state bits will still be set. There is no race-free way of unmasking and restoring the source. Thus this should only be used in interrupts that are also masked at the source, and only in cases where the interrupt is not meant to be used for a large amount of time because no valid target exists for it for example “lisn” is per “interrupts”, “interrupt-map”, or “ibm,xive-lisn-ranges” properties, or as returned by the ibm,query-interrupt-source-number RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall “target” is per “ibm,ppc-interrupt-server#s” or “ibm,ppc-interrupt-gserver#s” “priority” is a valid priority not in “ibm,plat-res-int-priorities” “eisn” is the guest EISN associated with the “lisn” Return Values: None Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “lisn” parameter per the list of interrupt sources allocated to the calling partition else return H_P2. If “priority” is not 0xFF Validate the “target” parameter per the list of threads allocated to the calling partition else return H_P3. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. Validate the “priority” parameter is a valid priority and not in listed in the “ibm,plat-res-int-priorities” property else return H_P4. Fill the Event Assignment Structure associated with “lisn” with: Block and Event Notification Descriptor Table Index associated with “target”/”priority” pair. Set the “M” bit to the value of the flags “M” bit. If setEisn==1, store “eisn”. Issue syncs required to ensure all in-flight interrupts are complete. Else Reset the Event Assignment Structure associated with “lisn” by: Issue syncs required to ensure all in-flight interrupts are complete Invalidating the Block and End Notification Descriptor Table Index Resetting the eisn Return H_Success.

H_INT_GET_SOURCE_CONFIG The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which target/priority pair is assigned to the specified Logical Interrupt Source. Syntax: Parameters: “flags”: bits 0-63 Reserved “lisn” is per “interrupts”, “interrupt-map”, or “ibm,xive-lisn-ranges” properties, or as returned by the ibm,query-interrupt-source-number RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall Return Values: R4: Target to which the specified Logical Interrupt Source is assigned, else this is undefined. R5: Priority to which the specified Logical Interrupt Source is assigned, else this is set to 0xFF (disabled). R6: EISN for the specified Logical Interrupt Source (this will be equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG). Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “lisn” parameter per the list of interrupt sources allocated to the calling partition else return H_P2. Load R4 with the target associated with the “lisn” parameter. Load R5 with the priority associated with the “lisn” parameter. Load R6 with the EISN associated with the “lisn” parameter. Return H_Success.

H_INT_GET_QUEUE_INFO The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real address of the notification management page7 associated with the specified target and priority. Syntax: Parameters: “flags”: bits 0-63 Reserved “target” is per “ibm,ppc-interrupt-server#s” or “ibm,ppc-interrupt-gserver#s” “priority” is valid priority not in the “ibm,plat-res-int-priorities” Return Values: R4: Logical real address of notification page R5: Power of 2 page size of the notification page Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “target” parameter per the list of threads allocated to the calling partition else return H_P2. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. Validate the “priority” parameter is a valid priority and not in listed in the “ibm,plat-res-int-priorities” property else return H_P3. Load R4 with the ESn page from the Event Notification Descriptor Table associated with “target” and “priority”. Load R5 with the page size of the ESn page. Return H_Success.

H_INT_SET_QUEUE_CONFIG The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset an EQ for a given “target” and “priority”. It is also used to set the notification config associated with the EQ. An EQ size of 0 is used to reset the EQ config for a given target and priority. If resetting the EQ config, the END associated with the given “target” and “priority” will be changed to disable queuing. Upon return from the hcall(), no additional interrupts will be directed to the old EQ (if one was set). The old EQ (if one was set) should be investigated for interrupts that occurred prior to or during the hcall(). Syntax: Parameters: “flags”: Bits 0-62: Reserved Bit 63: Unconditional Notify (n) per the XIVE spec “target”: is per “ibm,ppc-interrupt-server#s” or “ibm,ppc-interrupt-gserver#s” “priority”: is valid priority not in the “ibm,plat-res-int-priorities” “eventQueue”: The logical real address of the start of the EQ “eventQueueSize”: The power of 2 EQ size per “ibm,xive-eq-sizes” Return Values: None Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “target” parameter per the list of threads allocated the calling partition else return H_P2. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. Validate the “priority” parameter is a valid priority and not in listed in the “ibm,plat-res-int-priorities” property else return H_P3. Validate the “eventQueueSize” parameter per “ibm,xive-eq-sizes”, else return H_P5. Validate that if “eventQueueSize” is not 0 then the calling partition owns the logical real address in “eventQueue” for the length of “eventQueueSize” else return H_P4. If “Unconditional Notify” = 0 notification is conditioned by the notification page from H_INT_GET_QUEUE_INFO. If the “eventQueueSize” is not 0 then: The memory pointed to by “eventQueue” must be zeroed by the OS. The generation bit for the EQ will start at 1. The EQ page offset counter will start at 0. The EQ config will be set to “eventQueue” and “eventQueueSize”. If the “eventQueueSize” is 0 then: The EQ config will be reset. Queuing of interrupts will be disabled. Issue syncs required to ensure all in-flight interrupts are complete. Return H_Success.

H_INT_GET_QUEUE_CONFIG The H_INT_GET_QUEUE_CONFIG hcall() is used to get an EQ and the EQ size for a given target and priority. If requested via the “Debug” flag, this will also return the current generation value and event queue offset. Syntax: Parameters: "flags": Bits 0-62: Reserved Bit 63: Debug: Return debug data “target”: is per “ibm,ppc-interrupt-server#s” or “ibm,ppc-interrupt-gserver#s” “priority”: is valid priority not in the “ibm,plat-res-int-priorities” Return Values: R4: “flags”: Bit 0-62: Reserved Bit 62: The value of Event Queue Generation Number (g) per the XIVE spec if “Debug” = 1 Bit 63: The value of Unconditional Notify (n) per the XIVE spec R5: The logical real address of the start of the EQ R6: The power of 2 EQ size per “ibm,xive-eq-sizes” R7: The value of Event Queue Offset Counter per XIVE spec if “Debug” = 1, else 0 Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “target” parameter per the list of threads allocated the calling partition else return H_P2. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. Validate the “priority” parameter is a valid priority and not in listed in the “ibm,plat-res-int-priorities” property else return H_P3. Load R4 with the return flags, setting the reserved bits to 0. Load R5 with the logical real address of the EQ associated with “target” and “priority”. Set to -1 if no EQ has been specified Load R6 with the size of the EQ associated with the “target” and “priority”. Set to 0 if no EQ has been specified. If “Debug” = 1 Load the event queue generation number into the return flags Load R7 with the event queue offset counter Use the appropriate hardware facility to get an atomic view of the generation number and offset counter. Return H_Success.

H_INT_SET_OS_REPORTING_LINE The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the reporting cache line pair for the calling thread. The reporting cache lines will contain the OS interrupt context when the OS issues a CI store byte to @TIMA+0xC10 8 to acknowledge the OS interrupt. The reporting cache lines can be reset by inputting -1 in “reportingLine”. Issuing the CI store byte without reporting cache lines registered will result in the data not being accessible to the OS. Syntax: Parameters: “flags”: bits 0-63 Reserved “reportingLine”: The logical real address of the reporting cache line pair Return Values: None Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. If the “reportingLine” is not -1 Validate the calling partition owns the logical real address in “reportingLine” for two cache lines else return H_P2. Validate that the “reportingLine” is cached aligned, else return H_P2. Set the “reportingLine” in the NVT associated with the input “target”. If the “reportingLine” is -1 Reset the NVT’s reporting line. Return H_Success.

H_INT_GET_OS_REPORTING_LINE The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical real address of the reporting cache line pair set for the input “target”. If no reporting cache line pair has been set, -1 is returned. Syntax: Parameters: “flags”: bits 0-63 Reserved “target”: is per “ibm,ppc-interrupt-server#s” Return Values: R4: The logical real address of the reporting line if set, else -1 Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “target” parameter per the list of threads allocated the calling partition else return H_P2. Validate the thread indicated by “target” is online else return H_Not_Available. If the partition thread count is greater than the hardware thread count, validate the “target” has a corresponding hardware thread else return H_Not_Available. Load R4 with the logical real address of the reporting line associated with “target”. Load R4 with -1 if no reporting line has been set. Return H_Success.

H_INT_ESB Syntax: Parameters: “flags”: bits 0-62: Reserved bit 63: Store: Store=1, store operation, else load operation “lisn” is per “interrupts”, “interrupt-map”, or “ibm,xive-lisn-ranges” properties, or as returned by the ibm,query-interrupt-source-number RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall “esbOffset” is the offset into the ESB management page for the load or store operation “storeData” is the data to write for a store operation Return Values: R4: The value of the load if load operation, else -1 Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “lisn” parameter per the list of interrupt sources allocated to the calling partition else return H_P2. Validate the “esbOffset” parameter is valid per the XIVE Spec else return H_P3. If bit 63 of flags is 0 Issue the load operation to the “esbOffset” of the ESB management page associated with “lisn”. Load R4 with the results of the load operation. If bit 63 of flags is 1 Issue the store operation to the “esbOffset” of the ESB management page associated with “lisn”, storing “storeData”. Load R4 with -1. Return H_Success.

H_INT_SYNC The H_INT_SYNC hcall() is used to issue hardware syncs that will ensure any in flight events for the input lisn are in the event queue. Syntax: Parameters: “flags”: bits 0-63 Reserved “lisn” is per “interrupts”, “interrupt-map”, or “ibm,xive-lisn-ranges” properties, or as returned by the ibm,query-interrupt-source-number RTAS call, or as returned by the H_ALLOCATE_VAS_WINDOW hcall Return Values: None Semantics: Verify that no reserved flag bits are on else return H_Parameter. Verify that a H_INT_RESET is not in progress else return H_State. Validate the “lisn” parameter per the list of interrupt sources allocated to the calling partition else return H_P2. Perform the appropriate hardware syncs to ensure any in flight events for the input “lisn” are in the event queue. Return H_Success.

H_INT_RESET The H_INT_RESET hcall() is used to reset all of the partition’s interrupt exploitation structures to their initial state. This means losing all p reviously set interrupt state set via H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG. Syntax: Parameters: “flags”: bits 0-63 Reserved Return Values: None Semantics: Verify that no reserved flag bits are on else return H_Parameter. Block all other exploitation hcalls (they all will return H_STATE if called while H_INT_RESET is in progress). Verify that no other threads are currently in the middle of an H_INT_RESET for this partition else return H_STATE Reset the following: All EAs All ESB states All ENDs Specifically, including clearing out its EQ pointer and size All OS Reporting LinesUnblock all other exploitation hcalls when finished. Return H_Success.

Memory Migration Support hcall()s To assist an OS in memory migration, the following hcall() is provided. During the migration process, it is the responsibility of the OS to not change the DMA mappings referenced by the translations buffer (for example by using the H_GET_TCE, H_PUT_TCE hcall()s, or other DMA mapping hcall()s). Failure of the OS to serialize such DMA mapping access may result in undesirable DMA mappings within the caller’s partition (but not outside of the caller’s partition). Further, it is the responsibility of the OS to serialize calls to the H_MIGRATE_DMA service relative to the logical bus numbers referenced. Failure of the OS to serialize relative to the logical bus numbers may result DMA data corruption within the caller’s partition. On certain implementations, DMA read operations targeting the old page may still be in process for some time after the H_MIGRATE_DMA call returns; this requires that the OS not reuse/modify the data within the old page until the worst case DMA read access time has expired. The “ibm,dma-delay-time” property (see ) gives the OS this implementation dependent delay value. Failure to observe this delay time may result in data corruption as seen by the caller’s I/O adapter(s). R1--1. For the LPAR option supporting the hcall-migrate function set: The platform must supply the “ibm,dma-delay-time” property under the /rtas node of the device tree. Memory pages may be simultaneously mapped by multiple DMA agents, with different translation table formats and operation characteristics. The H_MIGRATE_DMA hcall() atomically performs the memory migration process so that the new page contains the old page contents (as updated by any DMA write operations allowed during migration), with all DMA mappings and engines directed to access the new page. The entries in the mapping list contain the logical bus number associated with the mapping and the I/O address of the mapping. From these two data, the hcall() associates the using DMA agent, that agent’s DMA control procedures, the specific mapping table and mapping table entry. R1--2. For the LPAR option supporting the hcall-migrate function set: The platform must support migration of pages mapped for DMA using any of the platform supported DMA agents. R1--3. For the LPAR option supporting the hcall-migrate function set: All the platform’s DMA agents must support mechanisms that enable the platform to meet the syntax, semantics and requirements set forth in section 14.5.4.8.1. Implementation Note: The minimal hardware mechanisms to support the hcall-migrate function set are to quiesce DMA operation, flush outstanding data to their targets (both reads and writes), modify their DMA mapping and re-enable operation utilizing said modified DMA mapping without introducing unrecoverable operational failures. Provision for the hardware to direct DMA write operations to both old and new pages provides a significantly more robust implementation. It is the intent of this architecture to have all memory in the platform have the capability to be migrated. However, on the rare implementation that cannot meet that intent, the “ibm,no-h-migrate-dma” property may be provided in memory nodes for which H_MIGRATE_DMA cannot be implemented. R1--4. For the LPAR option supporting the hcall-migrate function set: If a memory node cannot support H_MIGRATE_DMA, then that memory node must contain the “ibm,no-h-migrate-dma” property. For the I/O Super Page option the I/O page size is an attribute of the specified LIOBN (I/O pages mapped by a given LIOBN are a uniform size), also the syntax and semantics of H_MIGRATE_DMA are extended to allow migration of I/O pages that are larger than 4K bytes and have more than 256 xlates translation entries. Specifying more than 256 translation entries requires a sequence of calls to H_MIGRATE_DMA with the same “newpage” address. Making a call in the sequence with a length parameter of zero terminates the operation - should this termination happen after the start of the physical migration, the resulting state of the calling partition’s memory is unpredictable. Failure to make a continuing call in the sequence for more than one second aborts the operation; again the resulting state of the calling partition’s memory is unpredictable. The introduction of super pages introduces the case where portions of the super page may be I/O mapped and thus require the use of H_MIGRATE_DMA to move the logical super page from one physical page to another even though the super page as a whole may not be I/O mapped. To handle this case, the LIOBN value of 0xFFFFFFFF is reserved to allow the specification, within an translations entry (passed to H_MIGRATE_DMA via the xlates parameter), of a super page that is not currently I/O mapped. In this case, the normally reserved byte at xlates entry offset 4 is used to specify the power of two size of the super page. R1--5. For the I/O Super Page option: the platform must support the setting by the client of byte 3 bit 0 of the ibm,architecture.vec 5 as input to the ibm,client-architecture-support method.

H_MIGRATE_DMA Syntax: Parameters: newpage (The Logical address of the new page to be the target of the TCE translations) xlates (The Logical address of a list of translations against the target page the format of this list is: List starts on a page (4 K) boundary. Contains up to 256 translation entries: First 4 bytes of a translation entry is the logical bus number as from either the: “ibm,dma-window” property or the reserved LIOBN 0xFFFFFFFF. Next 12 bytes of a translation entry is the logical bus offset (I/O bus address). The format of the I/O bus address is dependent upon the DMA agent: For 32 bit PCI, the high order 8 bytes are reserved with the low order 4 bytes containing a 4 K aligned address (low order 12 bits =zero). For 64 bit PCI, the high order 4 bytes are reserved with the low order 8bytes containing a 4 K aligned address (low order 12 bits =zero). For the I/O Super Page option the very first translation entry passed is for the largest I/O page to be migrated by this sequence of calls; else all translation entries are for the single 4K byte logical page being migrated. The first translation entry may either be a current I/O mapping for the largest I/O page that the caller wishes to migrate, or the first translation entry may use the reserved LIOBN number of 0xFFFFFFFF, with the next byte indicating the page size as 2**N where N is the numeric value of the byte at offset 4 into the translation entry with the low order 8 bytes of the translation entry being the logical real address of the start of the page to be migrated (the low order N bits = zero). length (Number of entries in translation list is less than or equal to 256) If the total number of translation entries in the xlates list is less than or equal to 256 then the “length” parameter is the number of translation entries. For the I/O Super Page option and specifying more than 256 translation entries, the client makes a series of calls, each passing 256 translation entries with the “length” parameter being the negative of the total number of translation entries yet to be passed until there are less than or equal to 256 remaining then for the final call in the initiating sequence the “length” parameter is positive as above. Semantics: For the I/O Super Page option: determine if a migration operation is in process for this “newpage” address: Then: If the previous hcall() for the migration operation was more than 1 second ago, return H_Aborted. If the length parameter value is zero then abort the migration operation and return H_TERM. If the length parameter value is not the next expected in the sequence return H_P3. Record the new xlates If the length parameter is less than zero return H_CONTINUE. Else If the number of outstanding operations is more than an implementation specific number as communicated in the “ibm,vec-5” property then return H_Resource If the length parameter is less than zero, initiate a new migration operation for the “newpage” address. (Note resources for the operation may be allocated at this point and freed when the operation terminates either normally, in error, or via timeout. Implementations may, in unusual cases, use a busy return code to wait for the release of resources from an immanently completing operation. The first xlate entry specifies the length and starting address of the page to be migrated, if this specification is invalid (unsupported length, the address is invalid for the partition, or not aligned to the length) return H_MEM_PARM. If the operation specifies more than an implementation specific number of xlates as communicated in the “ibm,vec-5” property then return H_Resource. Check that the page to be migrated can be migrated, else H_PARAMETER. Check that the newpage is within the allocated logical page range of the calling partition and the address is aligned to the I/O page size of the first translation entry passed else H_PARAMETER. If the Shared Logical Resource option is implemented and the newpage parameter represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. The contents of the xlates buffer are checked. This may be done as each entry is used, or it may be done prior to starting the operation. If the former, then partial processing must be backed out in the case of a detected parameter error. If the later, then the translation entries must be copied into an area that is not accessible by the calling OS to prevent parameter corruption after they have been verified. The OS perceived reentrancy of the function is not diminished if this option is chosen. The xlates buffer starts on a 4 K boundary within the partition’s logical address range else H_PARAMETER. The length parameter is between (for the I/O Super Page option: the negative of the maximum number of xlate entries supported as indicated in the “ibm.architecture-vec-5” property of the /chosen device tree node else 1) and 256 else H_PARAMETER. For the I/O Super Page option: the length of the physical page to be migrated is the length of the I/O page of the first translation entry; else the length of the physical page to be migrated is 4K bytes. Each translation originally references the same physical page, or a portion there of, else H_PARAMETER. Each logical bus offset is within the allocated range of the calling partition else H_PARAMETER. If the Shared Logical Resource option is implemented and the logical bus offset represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. Check the logical bus number: Is allocated to the calling partition else H_PARAMETER. Or: If the Shared Logical Resource option is implemented and the logical bus number represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. For the I/O Super Page option: if the LIOBN implies a larger page size than that specified by the first translation entry for this migrate operation, place the index of the translation entry (0-255) into register R4 and return H_PGSB_PARM. If the LIOBN referenced an unsupported DMA agent, place the index of the translation entry (0-255) into register R4 and return H_Function. If the logical bus number is not supported, return H_PARAMETER. Note: The following is written from the perspective of a PCI DMA agent; other DMA agents may require a different sequence of operations to achieve equivalent results. The hypervisor disables arbitration for the IOA(s) associated with the translation entries. (In some cases, where multiple IOAs share a given TCE range, arbitration must be disabled for multiple IOAs. The firmware assigned the bus address ranges to each IOA so knows which IOAs correspond to which translation.) Waits for outstanding DMA write activity to complete. (This is accomplished by doing a load from an appropriate register the bridge(s) closest to the IOA -- when the load completes (dependency on load data is satisfied) all DMA write activity has completed.) The hypervisor copies the contents of the 4 K page originally accessed by the TCE(s) to the page referenced by the newpage value. The hypervisor translates the logical address within the newpage parameter and stores the resultant value in the TCE table entries specified by the translation entries. Executes a sync operation to ensure that the new TCE data is visible. The hypervisor enables arbitration on the IOA(s) associated with the translation entities and returns H_Success. Implementation Notes: The firmware should be written to minimize the arbitration disable time. The old page should be read into cache (possibly using the data cache touch operations) prior to disabling the arbitration. Implementation dependent algorithms can significantly improve the page copy time. The firmware does not have to serialize this hcall() with other hcall()s as long as it updates the TCE using atomic eight (8) byte write operations. However, if the OS does not serialize this call with H_PUT_TCE to the same TCE, and with other H_MIGRATE_DMA calls to the same IOA(s) the calling LPARs DMA buffers could be corrupted. To minimize the effect of such unsupported DMA agents, the platform designer should isolate such agents on their own bus with their own “ibm,dma-window” property specification.

Performance Monitor Support hcall()s

H_PERFMON To manage the Performance Monitor Function: Syntax: Parameters: mode-set Platform specific modes to be set by this call mode-reset Platform specific modes to be reset by this call Semantics: mode-set bit(s) check for platform specific validity else H_PARAMETER mode-reset bit(s) check for platform specific validity else H_PARAMETER if any mode-set bits are set, activate corresponding mode(s) - if logically capable else H_RESOURCE if any mode-reset bits are on, deactivate corresponding mode(s) - if logically capable else H_RESOURCE place current state of platform specific modes in R4, return H_Success Defined Perfmon mode bits: bit 0: 1= Enable Perfmon bit1: 0= Low threshold granularity 1= High threshold granularity

H_GET_DMA_XLATES_LIMITED This hcall returns the I/O bus address of the first entry defined for the specified LIOBN and the corresponding logical address within the range beginning with the Start logical address and less than the End logical addresses, the search is limited to the range of I/O bus addresses specified by the SIOBA and EIOBA parameters. R1--1. For the LRDR Option: The platform must implement the H_GET_DMA_XLATES_LIMITED hcall() per the syntax and semantics specified in section . R1--2. For the LRDR Option: The platform must present the “ibm,h-get-dma-xlates-limited-supported” property in all PCI host bridge OpenFirmware nodes for which the H_GET_DMA_XLATES_LIMITED hcall() is supported for all child LIOBNs. Syntax: Parameters: Register R4: Logical I/O Bus Number (LIOBN) Bits 0-31are reserved and set to zero. Bits 32-63 contain a 32-bit unsigned binary integer that identifies a translation which may have one or more entries that translate to a page within a range specified by the Start and End logical addresses. Register R5: Start Logical Address (SLA) Register R6: End Logical Address (ELA) Register R7: Start I/O Bus Address (SIOBA) of the translation specified by the LIOBN The SIOBA register may specify a special value of -1 or a starting IOBA Register R8: End I/O Bus Address (EIOBA) of the translation specified by the LIOBN The EIOBA register may specify a special value of -1 or an ending IOBA Semantics: Check that the specified LIOBN is supported and allocated to the calling logical partition, else H_PARAMETER. Check that the specified start logical address (SLA) is within the allocated range of the calling logical partition, and is designated on a 4 K-byte boundary, else H_P2. Check that the End logical address (ELA) minus 4K is within the allocated range of the calling logical partition, and is designated on a 4 K-byte boundary, else H_P3. (May point no further than one page beyond the maximum partition logical real address in order to stay within the partition yet include the last partition page in the range of the test.) Check that the specified starting logical address (SLA) is less than the specified ending logical address (ELA), else H_P2. Check that the page specified by the logical addresses within the specified range is within the allocated range of the calling logical partition and the address is 4 K-byte aligned else H_P2. Check the content of SIOBA If a value other than -1 is specified, check that the specified start I/O bus address (SIOBA) is not outside of the range of IOBAs for the specified LIOBN, else H_P4. If the SIOBA specifies a value of -1, the hypervisor starts the search at the lowest IOBA in the translation table, otherwise the search starts at the address specified by the SIOBA. Check the content of EIOBA If a value other than -1 is specified, check that the specified ending I/O bus address (EIOBA) is not outside of the range of IOBAs for the specified LIOBN, else H_P5. If the EIOBA specifies a value of -1, the hypervisor ends the search at the highest IOBA in the translation table, otherwise the search ends at the address specified by the EIOBA. Outputs: Place the I/O bus address and corresponding logical address into the respective registers: Register R4: I/O Bus Address (IOBA) This register contains a 64-bit unsigned binary integer that specifies the I/O bus address of the page within the specified logical address range for the specified LIOBN. The IOBA is returned when the hcall() completes with either H_PARTIAL, H_PAGE_REGISTERED, or H_IN_PROGRESS return codes. Register R5: Corresponding Logical Address (CLA) This register contains a 64-bit unsigned binary integer that designates the logical address of a page within the specified range that corresponds to the I/O bus address. If the hcall() completes with H_IN_PROGRESS return code, the corresponding logical address (CLA) is not returned. When the hcall() completes with H_PARTIAL or H_PAGE_REGISTERED return code: The I/O bus address (IOBA) and corresponding logical address (CLA) are returned. When the hcall() completes with H_PAGE_REGISTERED return code: The I/O bus address (IOBA) is for the final page of the translation table for the specified LIOBN as limited by the EIOBA parameter. When the hcall() completes with H_IN_PROGRESS return code: The current IOBA being searched against the specified range is returned, but the corresponding logical address is not returned. The hcall can be reissued by specifying the IOBA as the starting IOBA without incrementing the IOBA by the resource page size. Firmware Implementation Notes: When the H_GET_DMA_XLATES_LIMITED hcall() is issued, the hypervisor searches the translation table designated by the specified LIOBN, from the entry for SIOBA through the entry for EIOBA in IOBA order, for the entries that translate to a page within a given range of logical addresses. If an entry is found, the hcall() completes with the H_PAGE_REGISTERED return code if the page found is the last entry in the translation table, or the H_PARTIAL return code for all other pages, and the IOBA with the corresponding logical address are returned in output registers R4 and R5 respectively. The hypervisor searches the translation table in IOBA order, and proceeds in that order until an entry that translates to a physical address within the specified range of logical addresses is found, in which case, the hcall() completes with H_PARTIAL or H_PAGE_REGISTERED return code, or H_Success, if the end of the translation table, as specified by the EIOBA parameter, is reached. Software Implementation Notes: When the hcall() completes with H_PARTIAL return code, the stored IOBA is incremented by the page size of the resource corresponding to the specified LIOBN, and then specified as the starting I/O bus address on a subsequent call where the hypervisor would then proceed with the search until the end of the translation table, specified by the EIOBA parameter, is reached. The caller can accumulate a full list of the IOBAs for the specified LIOBN that translate into the specified range of logical addresses, which then forms part of the xlate translation entries specified as an input to the H_MIGRATE_DMA function. When the hcall() completes with H_PAGE_REGISTERED return code, this indicates that page is contained in the specified range of logical addresses, and it is the last page of the translation table such that the search for that LIOBN is complete. If a value other than -1 is specified in the starting I/O bus address register, the program should check that the specified SIOBA value is not the same as the returned IOBA.

RTAS Requirements RTAS function as specified in this architecture is still required for LoPAR LPAR partition. RTAS is instantiated via an OF client interface call. RTAS operates without memory translation, therefore, the OS should instantiate it within the RMA, however, the OF client interface does not enforce this limitation. The RTAS calling sequences remain unchanged. However, in LPAR configurations RTAS code is implemented differently than in non-LPAR systems. LPAR RTAS has a part which is replicated in each partition, and since RTAS has the capability to manipulate hardware system resources, RTAS has a part which is implemented in the hypervisor. In the hypervisor, there is a check of the RTAS parameters for validity before execution. Therefore, the function of the partition replicated RTAS call is to martial the arguments and make the required hidden hcall()s to the hypervisor. In a non-LPAR system, RTAS calls are assumed to be made with valid parameters. This cannot be assumed with LPAR. The LPAR RTAS operates by all the rules of non-LPAR RTAS relative to it running real, with real mode pointers to arguments and the same serialization requirement relative to a single partition. However, the hypervisor may not assume that the caller is following these serialization rules, failure on the part of the OS to properly serialize is allowed to cause unpredictable results within the scope of the calling partition but may not affect the proper operation of other platform partitions. The following is a list of RTAS functions that are not defined or implemented when the LPAR option is active: restart-rtas R1--1. For the LPAR option: The platform must implement the PowerPC External Interrupt option. R1--2. For the LPAR option: The Firmware must initialize each processor’s interrupt management area’s CPPR to the most favored level and its MFRR to the least favored level before passing control of the processor to the OS. R1--3. For the LPAR option: The RTAS rules of serialization of RTAS calls must only apply to a partition and not to the system. R1--4. For the LPAR option: The hypervisor cannot trust the RTAS calls to have no errors, therefore, the hypervisor must check a partition’s RTAS call parameters for validity. R1--5. For the LPAR option: RTAS must be instantiated within the RMA of partition storage. R1--6. For the LPAR option: RTAS arguments must be within the RMA of partition storage unless specifically specified in the RTAS call definition. R1--7. For the LPAR option: If one or more hcalls fail due to hardware error (return status -1), the platform must make available, prior to the completion of the next boot sequence, via an event-scan/check-exception, an error log indicating the hardware FRU responsible for such failures. Due to the asynchronous nature of error analysis, there is not a direct correlation between the log and a specific failing hcall(), indeed the error log may precede the failing hcall().

OF Requirements The hypervisor is initialized and configured prior to the loading of OF into the partition and boot of any client program (OS) in the partition by OF. The NVRAM data base that describes the platform’s partitioning is used to trigger the loading and initialization of the hypervisor. When Logical Partitioning is enabled, a copy of OF code is loaded into each partition where it builds the per partition device tree within the partition’s RAM. The per partition device tree contains only entries for platform components actually assigned to or used by the partition. The invocation of the subset of the OF Client interface specified below appears the same to the OS image regardless of the state of the LPAR option. A model of the boot sequence is as follows: Support processor runs chip tests and configures the CPU chips. The support processor loads the boot ROM image into System Memory along with the configuration information. POST code Initialization Firmware Hardware configuration reporting structures OF Hypervisor RTAS boot ROM executes POST and Initialization Firmware. Processor initialization code synchronizes the time bases of all platform processors to a small value (approaching zero). Initialization Firmware accesses the NVRAM Partition Database to determine if the LPAR option is enabled. Initialization Firmware initializes the hypervisor. The hypervisor configures itself using the hardware configuration reporting structures. The hypervisor configures the various partitions with resources as required by the NVRAM Partition Database. The hypervisor loads a copy of OF into each partition passing to OF a resource reporting structure known as the NACA/PACA. OF notices in the NACA/PACA that a specific partition table is specified. OF Scans the configuration and walks the buses to build the partition device tree. OF requests the specific partition table from the NVRAM Partition Database. OF loads RTAS into the partition’s memory. OF pulls in the configuration variables from the partition’s NVRAM area and uses them to determine the partition’s boot device. OF then loads the client program and starts executing it with one of the partition’s processors. The client program notices that it is running on a LPAR capable machine but does not have the hypervisor bit on in the MSR so must use hcall() routines for its PFT and TCE accesses. The presence of the “ibm,hypertas-functions” property is a duplicate indication of LPAR mode. R1--1. For the LPAR option: The OF code state must be retained after all partitions are initialized pending future boot requests. R1--2. For the LPAR option: The OF code must recognize that logical partitioning is required as opposed to a non-LPARed system. R1--3. For the LPAR option: The OF must generate the device tree for the partition within the partition’s RAM. R1--4. For the LPAR combined with Dynamic Reconfiguration option: The “interrupt-ranges” property for any reported interrupt source controller must report all possible interrupt source numbers. R1--5. For the LPAR option: The OF device tree for a partition must include in the root node, the “ibm,partition-no” property. R1--6. For the LPAR option: The OF device tree for an LPAR capable model not running in a partition must include in the root node, the “ibm,partition-no” property when the default partition number for the first partition created is not 1. R1--7. For the LPAR option: The “ibm,partition-no” property value must be an integer in the range of 1 to 220-1. R1--8. For the LPAR option: The OF device tree for a partition must include in the root node, the “ibm,partition-name” property. R1--9. For the LPAR option: When the platform does not provide a partition manager and the one and only partition in the system owns all the partition visible system resources, then the default value of the “ibm,partition-name” property must be the content of the SE keyword (as displayed in the same form as the root node “system-id” property) with a hyphen added between the plant of manufacture and sequence number. R1--10. For the LPAR option: The nodes of the OF device tree for a partition that represent platform resources that are not explicitly allocated for the control of the platform’s OS image must be marked “used-by-rtas”. This includes, but is not limited to, memory controllers, and IO bridges that are a part of the platform’s infrastructure common to more than one partition and commonly represented in the OF device tree. But does not include read only resources such as environmental sensors. R1--11. For the LPAR option: The OF must, at the OS’s request, load the required RTAS into the partition’s real addressable memory region. R1--12. For the LPAR option: The OF must use the partition’s segment of the NVRAM to establish the partition’s boot device and configuration variables. R1--13. For the LPAR option: The OF must load the client program and choose the partition’s processor on which to begin execution. Note: It is the responsibility of the client program to recognize whether or not to use LPAR page management. R1--14. For the LPAR option: The platform must initialize the time base of the first processor to a small (approaching zero) value prior to turning over control of the processor to a client program. R1--15. Reserved R1--16. For the LPAR option: The OF Client Interface must restrict access to only resources contained within the calling partition’s version of the device tree. R1--17. For the LPAR option: The OF Client Interface must prevent the calls of one partition’s client program from interfering with the operation of another partition’s client program. R1--18. For the LPAR option: The OF Client Interface must restrict its supported calls and methods to those specified in . R1--19. For the LPAR option: Any hidden hcall()s which firmware may use to implement OF functions must check its parameters to insure compliance with all of the architecturally mandated OF requirements. R1--20. For the LPAR option: The OF Client Interface functions “start-cpu” and “resume-cpu” must restrict their operation to processors assigned to the calling Client’s partition. OF Client Interface Functions Supported under the LPAR Option test cannon child finddevice getprop getproplen instance-to-package instance-to-path nextprop package-to-path parent peer setprop call-method test-method close open read seek write claim release boot enter exit start-cpu milliseconds size(/chosen/nvram) get-time instantiate-rtas

NVRAM Requirements The NVRAM is divided into multiple partitions each containing different categories of data similar to files in a file system (these NVRAM partitions are not to be confused with LPAR partitions). Each NVRAM partition is structured with a self identifying header followed by its partition unique data. Many of these NVRAM partitions contain data only relevant to the platform firmware, while others contain data that either is for OS image use from boot to boot or is used to communicate operational parameters from the OS image to the platform. The platform firmware on LPAR supporting platforms structures the NVRAM as per . Each LPAR partition is assigned a region of NVRAM space. This includes space for LPAR partition specific configuration variables as well as the minimum 4 K space reserved for the OS image. The hypervisor restricts access for the LPAR partition, through logical address translation and range checking, to its assigned NVRAM region. Other regions of NVRAM are reserved for firmware use including, for instance, information about how the system should be partitioned. LPAR NVRAM Map Real Address Range Per Partition NVRAM access routine rtas call Logical Address Range -- outside of legal range return 0x00 and discard write data. Contents 0x00 to F-1 NA Firmware only partitions (Signatures 0x00 to 0x6F) F to (F-1+P) 0x00 to P Per LPAR partition copies of supported NVRAM partitions with signatures 0x70 to 0x7F (F+P) to (F-1+2P) 0x00 to P Per LPAR partition copies of supported NVRAM partitions with signatures 0x70 to 0x7F ... (F+(P*(n-1))) to ((F-1)+ nP) 0x00 to P Per LPAR partition copies of supported NVRAM partitions with signatures 0x70 to 0x7F

NVRAM partitions on LPAR platforms Visible to: Partition Signatures Partition Name Comments Only to the Platform firmware 0x00 - 0x6F Only to Platform firmware and the OS image running in the owning LPAR Partition. The read and write NVRAM RTAS routines 0x70 Common This partition is duplicated per partition. 0x7F 0x7777777777777777-77777777 This partition is duplicated per partition and is at least 4 KB long when the OS is first installed.

R1--1. For the LPAR option: Platform OF must locate configuration variables that the OS must manipulate to select options as to how the specific OS image interfaces or relates to the platform in the partition’s “system” partition signature (0x70) named “common”, specifically none may be located in the “OF” signature (0x50). R1--2. For the LPAR option: The NVRAM region assigned to an LPAR partition must contain, after any platform required NVRAM partitions have been allocated, a free space partition a minimum of 4 KB long prior to the installation of the partition’s OS image.

Administrative Application Communication Requirements The platform needs to communicate with the an administrative application (outside of the scope of LoPAR) to manage the platform resources. The administrative application may run in an external computer such as a Hardware Management Console, or it may be integrated into a service partition. Many system facilities are not dedicated to an LPAR partition but are managed through the HMC and the administrative application. R1--1. For the LPAR option: The platform must provide a communications means to the administrative application. R1--2. For the LPAR option: The platform must respond to messages received from the administrative application.

RTAS Access to Hypervisor Virtualized Resources All allolcatable platform resources are always assigned to a partition. There always exists a dummy partition that is never active. Resources assigned to partitions that are inactive may be reassigned to other partitions by mechanisms implemented in the Hardware Management Console. R1--1. For the LPAR option: The nvram-fetch RTAS call must restricted access to only the LPAR partition’s assigned “OS”, “System” and “Error Log” nvram partitions. R1--2. For the LPAR option: The nvram-store RTAS call must restricted accss to only the LPAR partition’s assigned “OS”, “System” and “Error Log” nvram partitions. R1--3. For the LPAR option: The get-time-of-day RTAS call must return the LPAR partition’s specific time of day clock value. R1--4. For the LPAR option: The set-time-of-day RTAS call must set the LPAR partition’s specific time of day clock value. Firmware Implementation Note: The model implementation keeps time of day on a partition basis. What is really changed is the offset from the hardware TOD clock which is not normally written (Only written if for some reason it is approaching its maximum value, such as after a battery failure). R1--5. For the LPAR option: The event-scan RTAS call must report global events to each LPAR partition and LPAR partition local events only to the affected LPAR partition. R1--6. For the LPAR option: The check-exception RTAS call must report global events to each LPAR partition and LPAR partition local events only to the affected LPAR partition. R1--7. For the LPAR option: The rtas-last-error RTAS call must report only RTAS errors affecting the calling LPAR partition. R1--8. For the LPAR option: The ibm,read-pci-config RTAS calls must restrict access to only IOAs assigned to the calling LPAR partition, and if the configuration address is not available to the caller, must return a status of Success with all ones as the output value. R1--9. For the LPAR option: The ibm,write-pci-config RTAS calls must restrict access to only IOAs assigned to the calling LPAR partition, and if the configuration address is not available to the caller, must be ignored and must return a status of Success. R1--10. For the LPAR option: The ibm,write-pci-config RTAS calls must prevent changing of the firmware assigned interrupt message number on IOAs configured to use message signaled interrupts. R1--11. For the LPAR option: The platform must virtualize the display-character RTAS call such that the operator can distinguish and selectively read messages from each partition without interference with messages from other partitions. R1--12. For the LPAR option: The set-indicator RTAS call must restrict access to only indicators assigned to the calling LPAR partition. R1--13. For the LPAR option: The effects of the system-reboot RTAS call must be restricted to only the calling LPAR partition. Firmware Implementation Note: One standard OS response to a machine check is to reboot. Thus expecting the firmware to reset any error conditions such as in the I/O sub-system. When the I/O sub-system, or parts thereof, are shared among multiple partitions, the platform cannot allow the boot of one partition to prevent another partition from detecting that it was also affected by an I/O error. R1--14. For the LPAR option: The platform must deliver machine check and other event notifications to all affected partitions before initiating recovery operations such as rebooting and resetting hardware fault isolation circuits. R1--15. For the LPAR option: The start-cpu RTAS call must be restricted to only the processors assigned to the calling LPAR partition. R1--16. For the LPAR option: The query-cpu-stopped-state RTAS call must be restricted to only the processors assigned to the calling LPAR partition. R1--17. For the LPAR option: The power-off and ibm,power-off-ups RTAS calls must deactivate the calling partition and not power off the platform if other partitions remain active. R1--18. For the LPAR option: The set-time-for-power-on RTAS call must activate the platform when the partition requesting the earliest activation time is to be activated. R1--19. For the LPAR option: The ibm,os-term RTAS call must adjust support processor surveillance to account for the termination of the LPAR partition’s OS. R1--20. For the LPAR option: The ibm,set-xive RTAS call must restrict access to only interrupt sources assigned to the calling LPAR partition by silently failing if the interrupt source is not owned by the calling partition (return success without modifying the state of the unowed interrupt logic). R1--21. For the LPAR option: The ibm,set-xive RTAS call must restrict the written queue values to only interrupt processors assigned to the calling LPAR partition. R1--22. For the LPAR option: The ibm,get-xive RTAS call must restrict access to only interrupt sources assigned to the calling LPAR partition by silently failing if the interrupt source is not owned by the calling partition (return success with the least favored interrupt level, the interrupt server number is undefined -- possibly all ones). R1--23. For the LPAR option: The ibm,int-on RTAS call must restrict access to only interrupt sources assigned to the calling LPAR partition by silently failing if the interrupt source is not owned by the calling partition (return success without modifying the state of the unowed interrupt logic). R1--24. For the LPAR option: The ibm,int-off RTAS call must restrict access to only interrupt sources assigned to the calling LPAR partition by silently failing if the interrupt source is not owned by the calling partition (return success without modifying the state of the unowed interrupt logic). R1--25. For the LPAR option: The ibm,configure-connector RTAS call must restrict access to only Dynamic Reconfiguration Connectors assigned to the calling LPAR partition. R1--26. For the LPAR option: The platform must either define or virtualize the power domains used by the set-power-level RTAS call such that power level settings do not affect other partitions. R1--27. For the LPAR option: The set-power-level and get-power-level RTAS calls must restrict access to only power domains assigned to the calling partition. R1--28. For the LPAR option: The platform must restrict the availability of the ibm,exti2c RTAS call to at most one partition (like any IOA slot). R1--29. For the LPAR option: The ibm,set-eeh-option RTAS call must restrict access to only IOAs assigned to the calling partition. R1--30. For the LPAR option: The ibm,set-slot-reset RTAS call must restrict access to only IOAs assigned to the calling partition. R1--31. For the LPAR option: The ibm,read-slot-reset-state2 RTAS call must restrict access to only IOAs assigned to the calling partition. R1--32. For the LPAR option: The ibm,configure-bridge RTAS call must restrict access to only configuration addresses assigned to the calling partition. R1--33. For the LPAR option: The ibm,set-eeh-option RTAS call must restrict access to only IOAs assigned to the calling partition. R1--34. For the LPAR option: The platform must restrict the ibm,open-errinjct, ibm,close-errinjct, and ibm,errinjct RTAS calls as well as the errinjct properties be available on at most one partition as defined by a platform wide firmware configuration variable. R1--35. For the LPAR option: Any hidden hcall()s which firmware may use to implement RTAS functions must check its parameters to insure compliance with all of the architecturally mandated RTAS requirements.

Shared Processor LPAR Option The Shared Processor LPAR (SPLPAR) option allows the hypervisor to generate multiple virtual processors by time slicing a single physical processor. These multiple virtual processors may be assigned to one or more OS images. There are two primary customer advantages to SPLPAR over the standard LPAR. Most obviously, the assigned processing capacity of the partition can scale downwards to allow for more OS images to be supported on a single platform. The second customer advantage is that a SPLPAR platform can achieve higher processor utilization by providing partitions, that can use extra processing capacity, with the spare capacity ceded from other partitions. This allows the customer to take advantage of the variable nature of the instantaneous load on any one OS image to achieve an increase in the average utilization of the platform’s capacity. While the peak capacity (directly related to the platform cost) stays constant, the customer may see a significant improvement in the average capacity among all the platform’s workloads. However, since the peak capacity cannot be physically exceeded, the customer may experience a wider variance in performance when exercising the SPLPAR option. In principal, the OS images running on the virtual processors of an SPLPAR platform need not be aware that they are sharing their physical processors, however, in practice, they experience significantly better performance if they make a few optimizations. Specifically, if the OS images cedes their virtual processor to the platform when they are idle, and confers their processor to the holder of a spin lock for which the virtual processor must wait. Another significant change due to SPLPAR is that there may not be a fixed relationship between a virtual processor and the physical processor that actualizes it. In those cases, such physical information as location codes are undefined, affinity and associativity values are indistinguishable, relationships to secondary caches are meaningless, and any attempt by an OS to characterize the quality of its processor (such as running diagnostics or performance comparisons to other virtual processors) provide unreliable results. OF entities, that represent physical characteristics of a virtual processor that do not remain fixed, take on altered definitions/ requirements in an SPLPAR environment. To provide input to the capacity planning and quality of service tools, the hypervisor reports to an OS certain statistics, these include the minimum processor capacity that the OS can expect (the OS may cede any unused capacity back to the platform), the maximum processor capacity that the platform grants to the OS, the portion of spare capacity (up to the maximum) that the platform grants to the OS, and the maximum latency to a dispatch via an hcall(). The OS image optionally registers a data area (VPA) for each virtual processor using the H_REGISTER_VPA hcall(). The hypervisor maintains a variable, within the data area, that is incremented each time the virtual processor is dispatched/preempted, such that the dispatch variable is always even when the virtual processor is dispatched and always odd when it is not dispatched. The achitectural intent for the usage of the dispatch count variable is describe below in the paragraph devoted to conferring the processor. Additionally this hcall() may register a trace buffer which the OS may activate to gain detailed information about virtual processor preemption and dispatching. Both the VPA and the trace log buffer contain statistics on how long the virtual processor has waited (not been dispatched on a physical processor). Architecturally, the virtual processor wait time is divided into three intervals: The time that the virtual processor waited to become logically ready to run again, for example: The time needed to resolve a fault The time needed to process a hypervisor preemption The time until a wake up event after voluntarily relinquishing the physical processor The time spent waiting after interval 1 until virtual processor capacity was available. Shared processor partitions are granted a quantum of virtual processor capacity (execution time) each dispatch wheel rotation; thus if the partition has used its capacity, the ready to run virtual processor has to wait until the next quantum is granted. The time spent waiting after interval 2 until the virtual processor was dispatched on a physical processor. This is arises from the fact that multiple ready to run virtual processors with virtual processor capacity may be competing for a single physical processor. Two other performance statistics are available via hcall()s these are the Processor Utilization Register, and Pool Idle Count returned by the H_PURR and H_PIC hcall()s respectively. These two statistics are counts in the same units as counted by the processor time base. Like the time base, the PUR and PIC are 64 bit values that are set to a numerically low value during system initialization. The difference between their values at the end and beginning of monitored operations provides data on virtual processor performance. The value of the PUR is a count of processor cycles used by the calling virtual processor. The PUR count is intended to provide an indication to the partition software of the computation load supported by the virtual processor. SPLPAR virtual processors are created by dispatching the virtual processor’s architectural state on one of the physical processors from a pool of physical processors. The value of the PIC is the summation of the physical processor pool idle cycles, that is the number of time base counts when the pool could not dispatch a virtual processor. The PIC count is intended to provide an indication to platform management software of the pool capacity to perform more work. A well behaved OS image normally cedes its virtual processor to the platform using the H_CEDE hcall() after it determines that it currently has run out of useful work. The H_CEDE hcall() gives up the virtual processor until either an external interrupt (including decrementer, and Inter Processor Interrupt) or another one of the partition’s processors executes an H_PROD hcall() see below. Note the decrementer appears to continue to run during the time that the virtual processor is ceded to the platform. The H_CEDE hcall() always returns to the next instruction, however, prior to executing the next instruction, any pending interrupt is taken. To simulate atomic testing for work, the H_CEDE call may be called with interrupts disabled, however, the H_CEDE call activates the virtual processor’s MSREE bit to avoid going into a wait state with interrupts masked. A multi-processor OS uses two methods to initiate work on one processor from another, in both cases the requesting processor places a unit of work structure on a queue, and then either signals the serving processor via an Inter-Processor interrupt to service the work queue, or waits until the serving processor polls the work queue. The former method translates directly to the SPLPAR environment, the second method may experience significant performance degradation if the serving processor has ceded. To provide a solution to this performance problem, the SPLPAR provides the H_PROD hcall(). The H_PROD hcall() takes as a parameter the virtual processor number of the serving processor. Waking a potentially ceded or ceding processor is subject to many race conditions. The semantic of the H_PROD hcall() attempts to minimize these race conditions. First the H_CEDE and H_PROD hcall()s serialize on each other per target virtual processor. Secondly by having the H_PROD firmware set a per virtual processor memory bit before attempting to determine if the target virtual processor is preempted. If the processor is not preempted the H_PROD hcall() immediately returns, else the processor is dispatched and the memory bit is reset. If the processor was dispatched, and subsequently the virtual processor does a H_CEDE operation, the H_CEDE hcall() checks the virtual processor’s memory bit and if set, resets the bit and returns immediately (not ceding the physical processor to another virtual processor). An OS might choose to always do an H_PROD after an enqueue to a polled queue or it might qualify making the H_PROD hcall() with a status bit set by the by the target processor when it decides to cede its virtual processor. Locking in a SPLPAR environment presents a problem for multi-programming OSs, in that the virtual processor that is holding a lock may have been preempted. In that case, spinning, waiting for the lock, simply wastes time since the lock holder is in no position to release the lock -- it needs processor cycles and cannot get them for some period of time and the spinner is using up processor cycles waiting for the lock. The condition is known as a live lock, however, it eventually resolves itself. The SPLPAR optimization to alleviate this problem is to have the waiting virtual processor “confer” its processor cycles to the lock holder’s virtual processor until the lock holder has had a chance to run another dispatch time slice. As with the cede/prod pair of functions above, the confer function is subject to timing window races between the waiting process determining that the lock holder has been preempted and execution of the H_CONFER hcall() during which time the originally holding virtual processor may have been dispatched, released the lock and ceded the processor. To manage this situation, the H_CONFER takes two parameters, one that specifies the virtual processor(s) that are to receive the cycles and the second parameter (valid only when a single processor is specified) which represents the dispatch count of the holding virtual processor atomically captured when the waiting processor decided to confer its cycles to the waiting processor. The semantic of H_CONFER checks the processor parameter for validity, then if it is the “all processors” code proceeds to the description below. If the processor parameter refers to a valid virtual processor owned by the calling virtual processor’s partition, that is not dispatched, that has not conferred its cycles to all other processors, and who’s current dispatch count matches that of the second parameter, the time remaining from the calling processors time slice is conferred to the specified virtual processor. If the first parameter of H_CONFER specifies the “all processors” code, then it marks the calling virtual processor to confer all its cycles until all of the partition’s virtual processors, that have not ceded or conferred their cycles, have had a chance to run a dispatch time slice. The “all processors” version may be viewed as having the hypervisor record the dispatch counts for all the other platform processors in the calling virtual processor’s hypervisor owned “confer structure”, then prior to any subsequent dispatch of the calling processor, if the confer structure is not clear, the hypervisor does the equivalent of removing one entry from the confer structure and calling H_CONFER for the specific virtual processor. If the specific virtual processor confer is rejected (because the virtual processor is running, ceded, conferred, or the dispatch count does not match) then the next entry is tried until the confer structure is clear before the originally calling virtual processor is re-dispatched. Virtual processors may migrate among the physical processor pool from one dispatch cycle to the next. OF device tree properties that relate the characteristics of the specific physical processor such as location codes, and other vital product data cannot be consistent and are not reported in the nodes of type cpu if the partition is running in SPLPAR mode. Most processor characteristics properties such as time base increment rate, are consistent for all processors in the system physical and virtual so are still reported via their standard properties. Additionally nodes of type L2 are not present in the tree since they are shared with other virtual processors making optimizations based upon their characteristics impossible. The Processor Identification Register (PIR) should not be accessed by the OS since from cycle to cycle the OS may get different readings, instead the virtual processor number (the number from the “ibm,ppc-interrupt-server#s” property, contained in the nodes of type cpu, associated with this virtual processor) is used as the processor number to be passed as parameters to RTAS and hcall() routines for managing interrupts etc. Software Note: When the client program (OS) first gets control during the boot sequence, the virtual processor number of the single processor that is operational is identified by the /chosen node of the device tree. The cpu nodes list the other virtual processors that the first processor may start. These are started one at a time, giving the virtual processor number as an input parameter to the call. As each processor starts, it starts executing a program that picks up its virtual processor number from a memory structure initialized by the processor that called the start-cpu function. The newly started processor then records the location of its per processor memory structure (where it saves its virtual processor number) in one of the SPRG registers.

Virtual Processor Areas The per processor areas are registered with the H_REGISTER_VPA hcall() that takes three parameters. The first parameter is a flags field that specifies the specific sub function to be performed, the second is the virtual processor number of one of the processors owned by the calling virtual processor’s partition for whom the area is being registered. The third parameter is the logical address, within the calling virtual processor’s partition, of the contiguous logically addressed storage area to be registered. Registered areas are aligned on a cache line (l1) size boundary and may not span an LMB boundary and for the CMO option may not span an entitlement granule boundary. The length of the area is provided to the hcall() in starting in byte offset 4 of the area being registered. The H_REGISTER _VPA hcall() registers various types of areas, and after verifying the parameters, initializes the structure’s variables. Per Virtual Processor Area: This area contains shared processor operating parameters as defined in . A shared processor LPAR aware OS registers this area early in its initialization. The other types of virtual processor areas can only be registered after the Per Virtual Processor Area has been successfully registered. The minimum length of the Per Virtual Processor Area is 640 bytes and the structure may not span a 4096 byte boundary. Dispatch Trace Log Buffer: This area is optionally registered by OS’s that desire to collect performance measurement data relative to its shared processor dispatching. The minimum size of this area is 48 bytes while the maximum is 4 KB. See for more details SLB Shadow Buffer: This area is optionally registered by OS’s that support the SLB-shadow function set. The structure may not span a 4096 byte boundary. This function set allows the hypervisor to significantly reduce the overhead associated with virtual processor dispatch in a shared processor LPAR environment, and to provide enhanced recovery from SLB hardware errors. See for more details. Software Note: Registering, deregistering or changing the value of a variable in one of the Virtual Processor Areas for a different virtual processor (i.e. changing a value in the VPA of processor A from processor B) may be problematic. In no cases is partition integrity be compromised, but results may be imprecise if such a change is made during the virtual processor preempt/dispatch window. If the owning processor is started, registration or deregistration should only be done by the owning processor, if the processor is stopped, registration or deregistration can safely be done by other processors. Also, for example, changing the number of persistent SLB Shadow Buffer entries cause uncertainty in the number of currently valid SLB entries in that virtual processor. In some cases, such as turning on and off dispatch tracing, such uncertainty may be acceptable.

Per Virtual Processor Area Per Virtual Processor Area Byte Offset Length in Bytes Variable Description 0x00 4 Descriptor: This field is supplied for OS identification use, it may be set to any value that may be useful (such as a pattern that may be identified in a dump) or it may be left uninitalized. Historic values include: 0xD397D781 0x04 2 (unsigned) Size: The size of the registered structure (640) 0x6 - 0x17 18 Reserved for Firmware Use 0x18 - 0x1B 4 Physical Processor FRU ID 0x1C - 0x1F 4 Physical Processor on FRU ID 0x20 - 0x57 56 Reserved for Firmware Use 0x58 - 0x5F 8 Virtual processor home node associativity changes counters (changes in the 8 most important associativity levels) 0x60 - 0xAF 80 Reserved for Firmware Use 0xB0 1 Cede Latency Specifier 0xB1 1 Maintain EBB registers: =0 architected state of the event based branch facility may be discarded at any time, =1 architected state of the event based branch facility must be maintained, all other values are reserved 0xB2 6 Reserved For LoPAR Expansion 0xB8 1 Dispatch Trace Log Enable Mask: (Note this entry is valid only if a Dispatch Trace Log Buffer has been registered). A Trace Log Entry is created when the virtual processor is dispatched following its preemption for an enabled cause. =0 no dispatch trace logging Bit 7 =1 Trace voluntary (OS initiated) virtual processor waits Bit 6 =1 Trace time slice preempts Bit 5 = 1 Trace virtual partition memory page faults. All other values are reserved 0xB9 1 Bits 0-6 Reserved Bit 7 = 0 -- Dedicated processor cycle donation disabled Bit 7 = 1 -- Dedicated processor cycle donation enabled. 0xBA 1 Maintain FPRs: =0 architected state of floating point registers may be discarded at any time, =1 architected state of floating point registers must be maintained, all other values are reserved Note: When set in conjunction with offset 0xFF the 128 bit VSX space is saved on processors supporting the VSX option ( 2.06 and beyond). 0xBB 1 Maintain PMCs: =0 architected state of performance monitor counters may be discarded at any time, =1 architected state of performance monitor counters must be maintained, all other values are reserved 0xBC-0xD7 28 Reserved For Firmware Use 0xD8-0xDF 8 Any non-zero value is taken by the firmware to be the OS, estimate, in PURR units, of the cumulative number of cycles that it has consumed on this virtual processor, while idle, since it was initialized. 0xE0 - 0xFB 28 Reserved for Firmware Use 0xFC 2 (unsigned) Maintain #SLBs: This number of Segment Lookaside Buffer Registers (up to the platform implementation maximum) are maintained, all others (up to the platform implementing maximum) may be discarded at any time. The value 0xFFFF maintains all SLBs 0xFE 1 Idle: =0 The OS is busy on this processor =1 The OS is idle on this processor All other values are reserved 0xFF 1 Maintain VMX state: =0 architected state of the processor’s VMX facility, may be discarded at any time =1 architected state of the processor’s VMX facility, must be maintained All other values are reserved Note: When set in conjunction with offset 0xBA the 128 bit VSX space is saved on processors supporting the VSX option ( 2.06 and beyond). 0x100 4 (unsigned) Virtual Processor Dispatch Counter: (Even when virtual processor is dispatched odd when it is preempted/ceded/conferred) 0x104 4 (unsigned) Virtual Processor Dispatch Dispersion Accumulator: Incremented on each virtual processor dispatch if the physical processor differs from that of the last dispatch. 0x108 8 (unsigned) Virtual Processor Virtual Partition Memory Fault Counter: Incremented on each Virtual Partition Memory page fault. 0x110 8 (unsigned) Virtual Processor Virtual Partition Memory Fault Time Accumulator: Time, in Time Base units, that the virtual processor has been blocked waiting for the resolution of virtual Partition Memory page faults. 0x118 - 0x11F 8 Unsigned accumulation of PURR cycles expropriated by the hypervisor when VPA byte offset 0xFE = 1 0x120 - 0x127 8 Unsigned accumulation of SPURR cycles expropriated by the hypervisor when VPA byte offset 0xFE = 1 0x128 - 0x12F 8 Unsigned accumulation of PURR cycles expropriated by the hypervisor when VPA byte offset 0xFE = 0 0x130 - 0x137 8 Unsigned accumulation of SPURR cycles expropriated by the hypervisor when VPA byte offset 0xFE = 0 0x138 - 0x13F 8 Unsigned accumulation of PURR cycles donated to the processor pool when VPA byte offset 0xFE = 1 0x140 - 0x147 8 Unsigned accumulation of SPURR cycles donated to the processor pool when VPA byte offset 0xFE = 1 0x148 - 0x14F 8 Unsigned accumulation of PURR cycles donated to the processor pool when VPA byte offset 0xFE = 0 0x150 - 0x157 8 Unsigned accumulation of SPURR cycles donated to the processor pool when VPA byte offset 0xFE = 0 0x158-0x15F 8 Accumulated virtual processor wait interval 3 timebase cycles. (waiting for physical processor availability) 0x160 - 0x167 8 Accumulated virtual processor wait interval 2 timebase cycles. (waiting for virtual processor capacity) 0x168 - 0x16F 8 Accumulated virtual processor wait interval 1 timebase cycles. (waiting for virtual processor ready to run) 0x170 - 0x177 8 Reserved for Firmware Use 0x178 - 0x17F 8 Reserved for Firmware Use 0x180 - 0x183 4 For the CMO option: The OS may report in this field as a hint to the hypervisor the accumulated number, since the virtual processor was started, of ‘page in’ operations initiated for pages that were previously swapped out.” 0x184 - 0x187 4 Reserved for Firmware Use 0x188 - 0x18F 8 Reserved for Firmware Use 0x190 - 0x197 8 Reserved for Firmware Use 0x198 - 0x217 128 Reserved for Firmware Use 0x218 - 0x21F 8 Dispatch Trace Log buffer index counter. 0x220 - 0x27F 96 Reserved for Firmware Use

R1--1. For the SPLPAR option: If the OS registers a Per Virtual Processor Area, it must correspond to the format specified in .

Dispatch Trace Log Buffer The optional virtual processor dispatch trace log buffer is a circularly managed buffer containing individual 48 byte entries, with the first entry starting at byte offset 0. Therefore, the 4 byte registration size field is overwritten by the first Trace Log Buffer entry. (Note the hypervisor rounds down the dispatch trace log buffer length to a multiple of 48 bytes and wraps when reaching that boundary.) A vpa location contains the index counter that the hypervisor increments each time that it makes a dispatch trace log entry such that it always indicates the next entry to be filled. The low order bits (modulo the buffer length divided by 48) of the counter provide the index of the next entry to be filled, therefore, the buffer wraps each (buffer length divided by 48 entries), while the high order counter bits indicate how many buffer wraps have occurred. Prior to enabling dispatch trace logging, the OS should initialize the vpa index counter to the value of 0. The format of dispatch trace log buffer entries is given in . The architectural intent is that OS trace tools keep a shadow index counter into the log buffer of the next entry to be filled by the hypervisor. Prior to making an entry of their own, such tools compare their index counters with that of the hypervisor from the vpa, if they are equal, no preempts/dispatches have occurred since the last OS trace hook. If the two index counters are not equal, then the OS trace tool processes the intermediate time stamps into the OS’s trace log, updating its dispatch trace log buffer index until all have been processed, then the new trace entry is added to the OS’s trace log. Note, because of races, the processor may be preempted just prior to the OS trace tool adding the new trace log entry, to handle this case, the OS trace tool can examine the dispatch trace log buffer index immediately after the adding of the new trace log entry and if needed adjust its own trace log. In the extremely unlikely event that the two counters are off by trace buffer length divided by forty eight or more counts, the OS trace tool can detect that a dispatch trace log buffer overflow has occurred, and trace data has been lost. Dispatch Trace Log Buffer Entry Byte Offset Length in Bytes Variable Description 0x0 1 Reason Code for the virtual processor dispatch: 0: The virtual processor was dispatched at the external interrupt vector location to handle an IOA interrupt, Virtual interrupt, or interprocessor interrupt. 1: The virtual processor was dispatched to handle firmware internal events. 2: The virtual processor was dispatched at the next sequential instruction due to an H_PROD call by another partition processor. 3: The virtual processor was dispatched at the DECR interrupt vector due to a decrementer interrupt. 4: The processor was dispatched at location specified in load module (boot) or at the system reset interrupt vector. (virtual yellow button). 5: The virtual processor was dispatched to handle firmware internal events 6: The virtual processor was dispatched at the next sequential instruction to use cycles conferred from another partition processor 7: The virtual processor was dispatched at the next sequential instruction for its entitled time slice. 8: The virtual processor was dispatched at the faulting instruction following a virtual partition memory page fault. 10: The virtual processor was dispatched at the privileged doorbell interrupt vector location to handle a privileged doorbell interrupt. 0x1 1 Reason Code for virtual processor preemption: 0: Not used (for compatibility with earlier versions of the facility) 1: Firmware internal event 2: Virtual processor called H_CEDE 3: Virtual processor called H_CONFER 4: Virtual processor reached the end of its timeslice (HDEC) 5: Partition Migration/Hibernation page fault 6: Virtual memory page fault 0x2 - 0x3 2 Processor index of the physical processor actualizing the thread on this dispatch. 0x4 - 0x7 4 Time Base Delta between enqueued to dispatcher and actual dispatch on a physical processor 0x8 - 0xB 4 Time Base Delta between ready to run and enqueue to dispatcher 0xC - 0xF 4 Time Base Delta between waiting and ready to run (preempt/fault resolution time) 0x10 - 0x17 8 Time Base Value at the time of dispatch/wait 0x18 - 0x1F 8 For virtual processor preemption reason codes 5 & 6: Logical real address of faulting page; else reserved. 0x20 - 0x27 8 SRR0: At the time of preempt/wait 0x28 - 0x2F 8 SRR1: At the time of preempt/wait

R1--1. For the SPLPAR option: If the OS registers a Dispatch Trace Log Buffer, it must correspond to the format specified in .

SLB Shadow Buffer On platforms supporting the SLB-Buffer function set, the OS may optionally register an SLB shadow buffer area. When the OS takes this option, it allows the hypervisor to optimize the saving of SLB entries, thus reducing overhead and providing more processor capacity for the OS, and also allows the platform to recover from certain SLB hardware faults. When the OS registers an SLB shadow buffer for its virtual processor, the processor’s SLB is architecturally divided into three categories relative to their durability as depicted in .

Processor SLB relationship to the OS registered VPA and SLB Shadow Buffer OS may dynamically change M and N (for (N+1)*16 <= Length of SLB Shadow Buffer) Each category of SLB entries consists of 0-n contiguous SLBs. Persistent Entries: The first N (starting at SLB index 0, N specified by the numeric content of the first 4 bytes of the registered SLB Shadow Buffer) SLBs are maintained persistent across all virtual processor dispatches unless an unrecovered SLB error is noted. OS maintains a shadow copy of those SLB entries in the registered SLB shadow buffer. The OS sizes its SLB Shadow buffer for the largest number of persistent entries it can ever maintain. If the OS registers an SLB Shadow buffer, the hypervisor does not save the contents of the Persistent entries on virtual processor preempt, cede, or confer. The OS should minimally record as persistent the entries it needs to handle its SLB fault interrupts to fill in required Volatile (and potentially) Transient entries. Volatile Entries: The next M-N SLBs (beginning at the next higher SLB index after the last Persistent entry up through the entry specified by the “maintain#SLBs” parameter of the VPA) may disappear. The OS needs to be prepared to recover these entries via SLB fault interrupts. For performance optimization, the hypervisor normally maintains the state of these entries across H_DECR interrupts and most hcalls(), they may be lost on H_CEDE calls. Transient Entries: The platform makes no attempt to maintain the state of these entries and they may be lost at any time. The OS may dynamically change the number of Persistent entries by atomically changing the value of the 4 byte parameter at SLB Shadow Buffer offset 0. The hypervisor does not explicitly check the value of this parameter, however, the hypervisor limits the number of SLBs that it attempts to load from the shadow buffer to the lesser of the maximum number of SLB entries implemented by the platform, or the maximum number of entries containable in the SLB Shadow buffer length when it was registered. R1--1. For the SPLPAR option: If the OS registers an SLB Shadow Buffer, it must correspond to the format specified in .

Shared Processor LPAR OF Extensions

Shared Processor LPAR Function Sets in <emphasis role="bold"><literal>“ibm,hypertas-functions”</literal></emphasis> hcall-splpar hcall-pic SLB-Buffer

Device Tree Variances If an SPLPAR implementation does not maintain a fixed relationship between the virtual processor that it reports to the OS image in the OF device tree properties and the physical processor that it uses to actualize the virtual processor, then OF entities that imply a fixed physical relationship are not reported. These may include those listed in . OF Variances due to SPLPAR Entity Variance to standard definition “ibm,loc-code” property If the physical relationship between virtual processors and physical processors is not constant this property is omitted from the virtual processor’s node. If missing, the OS should not run diagnostics on the virtual processor “l2-cache” property If the physical relationship between virtual processors and physical processors is not constant the secondary cache characteristics are not relevant and this property is omitted from the virtual processor’s node. Nodes named l2-cache If the physical relationship between virtual processors and physical processors is not constant the secondary cache characteristics are not relevant and this node is omitted from the partition’s device tree. “ibm,associativity” property If the physical relationship between virtual processors and physical processors is not constant the “ibm,associativity” property reflects the same domain for all virtual processors actualized by a given physical processor pool. Note, even though the associativity of virtual processors may be indistinguishable, the associativity among other platform resources may be relevant.

R1--1. For the SPLPAR option: If the platform does not maintain a fixed relationship between its virtual processors and the physical processors that actualize them, then the platform must vary the device tree elements as outlined in .

Shared Processor LPAR Hypervisor Extensions

Virtual Processor Preempt/Dispatch A new virtual processor is dispatched on a physical processor when one of the following conditions happens: The physical processor is idle and a virtual processor was made ready to run (interrupt or prod) If the subfunction is a Register VPA or a Deregister VPA or SLB shadow buffer, verify that the proc-no parameter references an offline virtual proc or that the proc-no parameter matches the current virtual processor making the hcall, else return H_STATE The old virtual processor exhausted its time slice (HDECR interrupt). The old virtual processor ceded/conferred its cycles. When one of the above conditions occurs, the hypervisor, by default, records all the virtual processor architected state including the Time Base and Decrementer values and sets the hypervisor timer services to wake the virtual processor per the setting of the decrementer. The virtual processor’s Processor Utilization Register value for this dispatch is computed. The VPA’s dispatch count is incremented (such that the result is odd). Then the hypervisor selects a new virtual processor to dispatch on the physical processor using an implementation dependent algorithm having the following characteristics given in priority order: The virtual processor is “ready to run” (has not ceded/conferred its cycles or exhausted its time slice). Ready to run virtual processors are dispatched prior to waiting in excess of their maximum specified latency. Of the non-latency critical virtual processors ready to run, select the virtual processor that is most likely to have its working set in the physical processor’s cache or for other reasons runs most efficiently on the physical processor. If no virtual processor is “ready to run” at this time, start accumulating the Pool Idle Count (PIC) of the total number of idle processor cycles in the physical processor pool. Optionally, flags in the VPA may be set by the OS to indicate to the hypervisor that selected architected state of the virtual processor need not be maintained (that is, the contents of these architected facilities may be lost at any time without notice). The hypervisor may then optimize its preempt/dispatch routines accordingly. Refer to and SLB Shadow Buffer description for the definition of these flags and values. The hypervisor modifies any such OS setable and readable processor state that is not explicitly saved and restored on a virtual processor dispatch so as to prevent a covert channel between partitions. When the virtual processor is dispatched, the virtual processor’s “prod” bit is reset, the saved architected state of the virtual processor is restored from that saved when the virtual processor was preempted, ceded, or conferred, except for the time base which retains the current value of the physical processor and the decrementer which is reduced from the state saved value per current Time Base value minus saved Time Base value. The hypervisor sets up for computing the PUR value increment for the dispatch. At this time, the hypervisor increments the virtual processor’s VPA dispatch count (such that the value is even). The hypervisor checks the VPA’s dispatch log flag, if set, the hypervisor creates a pair of log entries in the dispatch log and stores the circular buffer index in the first buffer entry. If the virtual processor was signaled with an interrupt condition and the physical interrupt has been reset, then the hypervisor adjusts the virtual processor architected state to reflects that of a physical processor taking the same interrupt prior to executing the next sequential instruction and execution starts with the first instruction in the appropriate interrupt vector. If no interrupt has been signaled to the virtual processor or the physical interrupt is still active, then execution starts at the next sequential instruction following the instruction as noted by the hypervisor when the virtual processor ceded, conferred, or was preempted. The Platform allocates processor capacity to a partition’s virtual processors using the architectural metaphor of a “dispatch wheel” with a fixed implementation dependent rotation period. Each virtual processor receives a time slice each rotation of the dispatch wheel. The length of the time slice is determined by a number of parameters, the OS image has direct control, within constraints, over three of these parameters (number of virtual processors, Entitled Processor Capacity Percentage, Variable Processor Capacity Weight). The constraints are determined by partition and partition aggregate configurations that are outside the scope of this architecture. For reference, partition definitions provide the initial settings of these parameters while the aggregation configurations provide the constraints (including the degenerate case where an aggregation encapsulates only a single member LPAR). Entitled Processor Capacity Percentage: The percentage of a physical processor that the hypervisor guarantees to be available to the partition’s virtual processors (distributed in a uniform manner among the partition’s virtual processors -- thus the number of virtual processors affects the time slice size) each dispatch cycle. Capacity ceded or conferred from one partition virtual processor extends the time slices offered to other partition processors. Capacity ceded or conferred after all of the partition’s virtual processors have been dispatch is added to the variable capacity kitty. The initial, minimum and maximum constraint values of this parameter are determined by the partition configuration definition. The H_SET_PPP hcall() allows the OS image to set this parameter within the constraints imposed by the partition configuration definition minimum and maximums plus constraints imposed by partition aggregation. Variable Processor Capacity Weight: The unitless factor that the hypervisor uses to assign processor capacity in addition to the Entitled Processor Capacity Percentage. This factor may take the values 0 to 255. A virtual processor’s time slice may be extended to allow it to use capacity unused by other partitions, or not needed to meet the Entitled Processor Capacity Percentage of the active partitions. A partition is offered a portion of this variable capacity kitty equal to: (Variable Processor Capacity Weight for the partition) / (summation of Variable Processor Capacity Weights for all competing partitions). The initial value of this parameter is determined by the partition configuration definition. The H_SET_PPP hcall() allows the OS image to set this parameter within the constraints imposed by the partition configuration definition maximum. Certain partition definitions may not allow any variable processor capacity allocation. Unallocated Processor Capacity Percentage: The amount of processor capacity that is currently available within the constraints of the LPAR's current environment for allocation to Entitled Processor Capacity Percentage. Race conditions may change the current environment before a request for this capacity can be performed, resulting in a constrained return from such a request. Unallocated Variable Processor Capacity Weight: The amount of variable processor capacity weight that is currently available within the constraints of the LPAR's current environment for allocation to the partition's variable processor capacity weight. Race conditions may change the current environment before a request for this capacity can be performed, resulting in a constrained return from such a request. System Parameters readable via the ibm,get-system-parameter RTAS call (see ) communicate a variety of configuration and constraint parameters among which are determined by the partition definition. By means that are beyond the scope of this architecture, various partitions may be organized into aggregations, for example “LPAR groups”, for the purposes of load balancing. These aggregations may impose constraints such as: “The summation of the minimum available capacity for all virtual processors supported by the LPAR group cannot exceed 100% of the group’s configured capacity”. R1--1. For the SPLPAR option: The platform must dispatch each partition virtual processors each dispatch cycle unless prevented by the semantics of the H_CONFER hcall(). R1--2. For the SPLPAR option: The summation of the processing capacity that the platform dispatches to the virtual processors of each partition must be at least equal to that partition's Entitled Processor Capacity Percentage unless prevented by the semantics of the H_CONFER and H_CEDE hcall()s. R1--3. For the SPLPAR option: The processing capacity that the platform dispatches to each of the partition's virtual processors must be substantially equal unless prevented by the semantics of the H_CONFER and H_CEDE hcall()s. R1--4. For the SPLPAR option: The platform must distribute processor capacity allocated to SPLPAR virtual processor actualization not consumed due to Requirements , , and to partitions in strict accordance with the definition of Variable Processor Capacity Weight unless prevented by the LPAR's definition (capped) or the semantics of the H_CONFER and H_CEDE hcall()s. Note: A value of 0 for a Variable Processor Capacity Weight effectively caps the partition at its Entitled Processor Capacity Percentage value. R1--5. For the SPLPAR option on platforms: The platform must increment the counters in VPA offsets 0x158-0x16F per their definitions in . R1--6. For the SPLPAR option on platforms : To maintain compatibility across partition migration and firmware version levels the OS must be prepared for platform implementations that do not increment VPA offsets 0x158 - 0x16F.

H_REGISTER_VPA Register Virtual Processor Areas (these include the parameter area known as the VPA, the Dispatch Trace Log Buffer, and if the SLB-Buffer function set is supported, the SLB Shadow Buffer). Note if the caller makes multiple registration requests for a given per virtual processor area for a given virtual processor, the last registration wins, and if the same memory area is registered for multiple processors, the area contents are unpredictable, however, LPAR isolation is not compromised. The syntax of the H_REGISTER_VPA hcall() is given below. Syntax: Semantics: wVerify that the flags parameter is a supported value else return H_Parameter. (That the subfunction field (Bits 16-23) is one of the values supported by this call. Optionally that all other bits are zero. Callers should not set any bits other than those specifically defined, however, implementations are not required to check the value of bits outside of the subfunction field.) Verify that the proc-no parameter references a virtual processor owned by the calling virtual processor’s partition else return H_Parameter If the sub function is a register, verify that the addr parameter is an L1 cache line aligned logical address within the memory owned by the calling virtual processor’s partition else return H_Parameter. If the Shared Logical Resource option is implemented and the addr parameter represents a shared logical resource location that has been rescinded by the owner, return H_RESCINDED. Case on subfunction in flags parameter: Register VPA: Verify that the size field (2 bytes) at offset 0x4 is at least 640 bytes else return H_Parameter. Verify that the entire structure (per the size field and vpa) does not span a 4096 byte boundary else return H_Parameter. Record the specified processor’s vpa logical address for access by other SPLPAR hypervisor functions. Initialize the contents of the area per . Return H_Succes Register Dispatch Trace Log Buffer: Verify that the size field (4 bytes) at offset 0x4 is at least 48 bytes else return H_Parameter. For the CMO option, verify that the entire structure (per the size field and vpa parameter) does not span a memory entitlement granule boundary else return H_MLENGTH_PARM. Verify that a VPA has been registered for the specified virtual processor else return H_RESOURCE. Initialize the specified processor’s preempt/dispatch trace log buffer pointers and index. Return H_Success. Register SLB Shadow Buffer (if SLB-Buffer function set is supported): Verify that the size field (4 bytes) at offset 0x4 is at least 8 bytes and that the entire structure (per the size and vpa parameters) does not span a 4096 byte boundary else return H_Parameter. Verify that a VPA has been registered for the specified virtual processor else return H_RESOURCE. Initialize the specified processor’s SLB Shadow buffer pointers and set the maximum persistent SLB restore index to the lesser of the maximum number of processor SLBs or the maximum number of entries in the registered SLB Shadow buffer. Return H_Success. Deregister VPA: Verify that a Dispatch Trace Log buffer is not registered for the specified processor else return H_RESOURCE. Verify that an SLB Shadow buffer is not registered for the specified processor else return H_RESOURCE. Clear any partition memory pointer to the specified processor’s VPA (note no check is made that a valid VPA registration exists). Return H_Success. Deregister Dispatch Trace Log Buffer: Clear any partition memory and/ or hypervisor pointer to the specified processor’s Dispatch Trace Buffer (note no check is made that a valid Dispatch Trace Buffer registration exists). Return H_Success. Deregister SLB Shadow Buffer (if SLB-Buffer function set is supported): Clear any hypervisor pointer(s) to the specified processor’s SLB Shadow buffer (note no check is made that a valid SLB Shadow buffer registration exists). Zero the hypervisor’s maximum persistent SLB restore index for the specified processor. Return H_Success. Else Return H_Function. R1--1. For the SPLPAR option: The platform must implement the H_REGISTER_PVA hcall() following the syntax and semantics of . R1--2. For the SLPAR plus SLB Shadow Buffer options: The platform must register, and deregister the optional SLB Shadow buffer per the syntax and semantics of . R1--3. For the SLPAR plus SLB Shadow Buffer options: The platform must make persistent the SLB entries recorded by the OS within the SLB Shadow buffer as described in .

H_CEDE The architectural intent of this hcall() is to have the virtual processor, which has no useful work to do, enter a wait state ceding its processor capacity to other virtual processors until some useful work appears, signaled either through an interrupt or a prod hcall(). To help the caller reduce race conditions, this call may be made with interrupts disabled but the semantics of the hcall() enable the virtual processor’s interrupts so that it may always receive wake up interrupt signals. As a hint to the hypervisor, the cede latency specifier indicates how long the OS can tolerate the latency to an H_PROD hcall() or interrupt, this may affect how the hypervisor chooses to use or even power down the actualizing physical processor in the mean time. Software Note: The floating point registers may not be preserved by this call if the “Maintain FPRs” field of the VPA =0, see . Syntax: Semantics: Enable the virtual processor’s MSREE bit as if it was on at the time of the call. Serialize for the virtual processor’s control structure with H_PROD. If the virtual processor’s “prod” bit is set, then: Reset the virtual processor’s “prod” bit. Release the virtual processor’s control structure. Return H_Success. Record all the virtual processor architected state including the Time Base and Decrementer values. Set hypervisor timer services to wake the virtual processor per the setting of the decrementer. Mark the virtual processor as non-dispatchable until the processor is the target of an interrupt (system reset, external including decrementer or IPI) or PROD. Cede the time remaining in the virtual processor’s time slice preferentially to the virtual processor’s partition. Release the virtual processor’s control structure. Dispatch some other virtual processor Return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_CEDE hcall() following the syntax and semantics of .

H_CONFER The architectural intent of this hcall() is to confer the callers processor capacity to the holder of a lock or the initiator of an event that the caller is waiting upon. If the caller knows the identity of the lock holder then the holder’s virtual processor number is supplied as a parameter, if the caller does not know the identity of the lock holder then the “all processors” value of the proc parameter is specified. If the caller is conferring to the initiator of an event the proc parameter value of the calling processor. This call may be made with interrupts enabled or disabled. This call provides a reduced “kill set” of volatile registers, GPRs r0 and r4-r13 are preserved. Software Note: The floating point registers may not be preserved by this call if the “Maintain FPRs” field of the VPA =0, see . Syntax: Semantics: Validate the proc number else return H_Parameter. Valid Values: -1 (all partition processors) 0 through N one of the processor numbers of the calling processor's partition The calling processor's number forces a confer until the calling processor is PRODed If the proc number is for a single processor and the single processor is not the calling processor, then If the dispatch parameter is not equal to the specified processor’s hypervisor copy of the dispatch number or the hypervisor copy of the dispatch number is even, then return H_Success. If the target processor has conferred its cycles to all others, then return H_Success. Firmware Implementation Note: If one were to confer to a processor that had conferred to all, then a dead lock could occur, however, there are valid cases with nested locks were this could happen, therefore, the hypervisor call silently ignores the confer. Record all the virtual processor architected state including the Time Base and Decrementer values. If the MSREE bit is on, set hypervisor timer services to wake the virtual processor per the setting of the decrementer. Mark the virtual processor as non-dispatchable until one of the following: System reset interrupt. The MSREE bit is on and the virtual processor is the target of an external interrupt (including decrementer or IPI). The virtual processor is the target of a PROD operation. The specified target processor (or all partition processors if the proc parameter value is a minus 1) have had the opportunity of a dispatch cycle. Confer the time remaining in the virtual processor’s time slice to the virtual processor’s partition. Dispatch the/a partition target virtual processor. Return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_CONFER hcall() following the syntax and semantics of . R1--2. For the SPLPAR option: The platform must implement the H_CONFER hcall() such that the only GPR that is modified by the call is r3.

H_PROD Awakens the specific processor. This call provides a reduced “kill set” of volatile registers, GPRs r0 and r4-r13 are preserved. Syntax: Semantics: Verify that the target virtual processor specified by the proc parameter is owned by the calling virtual processor’s partition. Serialize for the Target Virtual Processor’s control structure with H_CEDE. Set “prod” bit in the target virtual processor’s control structure. If the target virtual processor is not ready to run, mark the target virtual processor ready to run. Release the target virtual processor’s control structure. Return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_PROD hcall() following the syntax and semantics of . R1--2. For the SPLPAR option: The platform must implement the H_PROD hcall() such that the only GPR that is modified by the call is r3.

H_GET_PPP This hcall() returns the partition’s performance parameters. The parameters are packed into registers: Register R4 contains the Entitled Processor Capacity Percentage for the partition. In the case of a dedicated processor partition this value is 100* the number of processors owned by the partition. Register R5 contains the Unallocated Processor Capacity Percentage for the calling partition’s aggregation. Register R6 contains the aggregation numbers of up to 4 levels of aggregations that the partition may be a member. Bytes 0-1: Reserved for future aggregation definition, and set to zero -- in the future this field may be given meaning. Bytes 2-3: Reserved for future aggregation definition, and set to zero -- in the future this field may be given meaning. Bytes 4-5: 16 bit binary representation of the “Group Number”. Bytes 6-7: 16 bit binary representation of the “Pool Number”. In the case of a dedicated processor partition the “Pool Number” is not applicable which is represented by the code 0xFFFF. Register R7 contains the platform resource capacities: Bytes 0 Reserved for future platform resource capacity definition, set to zero -- in the future this field may be given meaning. Byte 1 is a bit field representing the capping mode of the partition’s virtual processor(s): Bits 0-6 are reserved, and set to zero -- in the future these bits may be given meaning as new capping modes are defined Bit 7 -- The partition’s virtual processor(s) are capped at their Entitled Processor Capacity Percentage. In the case of dedicated processors this bit is set. Byte 2: Variable Processor Capacity Weight. In the case of a dedicated processor partition this value is 0x00. Byte 3: Unallocated Variable Processor Capacity Weight for the calling partition’s aggregation. Bytes 4-5 16 bit unsigned binary representation of the number of processors active in the caller’s Processor Pool. In the case of a dedicated processor partition this value is 0x00. Bytes 6-7 16 bit binary representation of the number of processors active on the platform. When the value of the “ibm,partition-performance-parameters-level” see ) is >=1 then register R8 contains the processor virtualization resource allocations. In the case of a dedicated processor partition R8 contains 0: Bytes 0-1: 16 bit unsigned binary representation of the number of physical platform processors allocated to processor virtualization. Bytes 2-4: 24 bit unsigned binary representation of the maximum processor capacity percentage that is available to the partition's pool. Bytes 5-7: 24 bit unsigned binary representation of the entitled processor capacity percentage available to the partition's pool. Syntax: Semantics: Place the partition’s performance parameters for the calling virtual processor’s partition into the respective registers: R4: The calling partition’s Entitled Processor Capacity Percentage R5: The calling partition’s aggregation’s Unallocated Processor Capacity Percentage. R6: The aggregation numbers R7: The platform resource capacities R8: When “ibm,partition-performance-parameters-level” is >= 1 in the device tree, R8 is loaded with the processor virtualization resource allocations Return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_GET_PPP hcall() following the syntax and semantics of .

H_SET_PPP This hcall() allows the partition to modify its entitled processor capacity percentage and variable processor capacity weight within limits. If one or both request parameters exceed the constraints of the calling LPAR’s environment, the hypervisor limits the set value to the constrained value and returns H_Constrained. The H_GET_PPP call may be used to determine the actual current operational values. By the hypervisor constraining the actual values, the calling partition does not need special authority to make the H_SET_PPP hcall(). See for definitions of these values. Syntax: Semantics: Verify that the variable processor capacity weight is between 0 and 255 else return H_Parameter. Verify that the capacities specified is within the constraints of the partition: If yes, atomically set the partition’s entitled and variable capacity per the request and return H_Success. If not set the partition’s entitled and variable capacity as constrained by the partition’s configuration and return H_Constrained. Firmware Implementation Note: If the dispatch algorithm requires that the summation of variable capacities be updated, it is atomically updated with the set of the partition’s weight. R1--1. For the SPLPAR option: The platform must implement and make available to selected partitions, the H_SET_PPP hcall() following the syntax and semantics of .

H_PURR The Processor Utilization of Resources Register (PURR) is compatibly read through the H_PURR hcall(). In those implementations running on processors that do not implement the register in hardware, firmware simulates the function. On platforms that present the property “ibm,rks-hcalls” with bit 2 set (see ), this call provides a reduced “kill set” of volatile registers, GPRs r0 and r5-r13 are preserved. Syntax: Semantics: If the platform presents the “ibm,rks-hcall” property with bit 2 set; then honor a kill set of volatile registers r3 & r4. Compute the PURR value for the calling virtual processor up to the current point in time and place in R4 Return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_PURR hcall() following the syntax and semantics of .

H_POLL_PENDING Certain implementations of the hypervisor steal processor cycles to perform administrative functions in the background. The purpose of the H_POLL_PENDING hcall() is to provide a OS, running atop such an implementation, with a hint of pending work so that it may more intelligently manage use of platform resources. The use of this call by an OS is totally optional since such an implementation also uses hardware mechanisms to ensure that the required cycles can be transparently stolen. It is assumed that the caller of H_POLL_PENDING is idle, if all threads of the processor are idle (as indicated by the idle flag at byte offset 0xFE of ), the hypervisor may choose to perform a background administrative task. The hypervisor returns H_PENDING if there is pending administrative work, at the time of the call, that it could dispatch to the calling processor if the calling processor were ceded, if there is no such pending work, the return code is H_Success. Due to race conditions, this pending work may have grown or disappeared by the time the calling OS makes a subsequent H_CEDE call. There is NO architectural guarantee that ceding a processor exempts a virtual processor from preemption for a given period of time. That may indeed be the characteristic of a given implementation, but cannot be expected from all future implementations. Syntax: Semantics: Return H_PENDING if there is work pending that could be dispatched to the calling processor if it were ceded, else return H_Success. R1--1. For the SPLPAR option: The platform must implement the H_POLL_PENDING hcall() following the syntax and semantics of .

Pool Idle Count Function Set The hcall-pic function set may be configured via the partition definition in none or any number of partitions as the weights administrative policy dictates.

H_PIC Syntax: Semantics: Verify that calling partition has the authority to make the call else return H_Authority. Compute the PIC value for the processor pool implementing the calling virtual processor up to the current point in time and place into R4 Place the number of processors in the caller’s processor pool in R5. When the value of the “ibm,partition-performance-parameters-level” (see ) is >=1 then: Place the summation of time base ticks for all platform processors, allocated to the caller's processor pool, into register R6. Place the summation of all PURR ticks accumulated by all dispatched (not idle) platform processor threads, allocated to the caller's processor pool, into register R7. Place the summation of all SPURR Machines that do not have a SPURR mechanism are assumed to run at a constant speed, at which time the PURR value is substituted. ticks accumulated by all dispatched (not idle) platform processor threads, allocated to the caller's processor pool, into register R8. Place the caller's processor pool ID into low order two bytes of register R9 (high order 6 bytes are reserved - set to 0x000000). If the calling partition has the authority to monitor total processor virtualization then: Place the summation of time base ticks for all platform physical processors, allocated to processor virtualization, in register R10. Place the summation of all PURR ticks accumulated by all dispatched (not idle) platform physical processor threads, allocated to processor virtualization, in register R11. Place the summation of all SPURR ticks accumulated by all dispatched (not idle) platform physical processor threads, allocated to processor virtualization, in register R12. Else load R10, R11 and R12 with -1. Return H_Success. R1--1. For the SPLPAR option: The platform must implement and make available to selected partitions, the H_PIC hcall() following the syntax and semantics of .

Thread Join Option

H_JOIN The H_JOIN hcall() performs the equivalent of a H_CONFER (proc=self) hcall() (see ) unless called by the sole unjoined (active) processor thread, at which time the H_JOIN hcall() returns H_CONTINUE. H_JOIN is intended to establish a single threaded operating environment within a partition; to prevent external interrupts from complicating this environment, H_JOIN returns “bad_mode” if called with the processor MSR[EE] bit set to 1. Joined (inactive) threads are activated by H_PROD (see ) which starts execution at the instruction following the hcall; or a system reset non-maskable interrupt which appears to interrupt between the hcall and the instruction following the hcall. Syntax: Semantics If MSR.EE=1 return bad_mode. If other processor threads are active in the calling partition, then emulate H_CONFER (proc=self) Else return H_CONTINUE. R1--1. For the Thread Join option: The platform must implement the H_JOIN hcall() following the syntax and semantics of . R1--2. For the Thread Join option: The platform must implement the hcall-join and hcall-splpar function sets. R1--3. For the Thread Join option: The platform must support the H_PROD hcall even if the partition is operating in dedicated processor mode.

Virtual Processor Home Node Option (VPHN) The SPLPAR option allows the platform to dispatch virtual processors on physical processors that due to the variable nature of work loads are temporarily free, thus improving the utilization of computing resources. However, SPLPAR implies inconsistent mapping of virtual to physical processors; defeating resource allocation software that attempts to optimize performance on platforms that implement the NUMA option. To bridge the gap between these two options, the VPHN option maintain a substantially consistent mapping of a given virtual processor to a physical processor or set of processors within a given associativity domain. Thus the OS can, when allocating computing resources, take advantage of this statistically consistent mapping to improve processing performance. VPHN mappings are substantially consistent but not static. For any given dispatch cycle, a best effort is made to dispatch the virtual processor on a physical processor within a targeted associativity domain (the virtual processor's home node). However, if processing capacity within the home node is not available, some other physical processor is assigned to meet the processing capacity entitlement. From time to time, to optimize the total platform performance, it may be necessary for the platform to change the home node of a given virtual processor. To enable the OS to determine the associativity domain of the home node of a virtual processor, platforms implementing the VPHN option provide the H_HOME_NODE_ASSOCIATIVITY hcall(). The presence of the hcall-vphn function set in the “ibm,hypertas-functions” property indicates that the platform supports the VPHN option. The OS should be prepared for the support of the VPHN option to change with functions such partition migration, after which a call to H_HOME_NODE_ASSOCIATIVITY may end with a return code of H_FUNCTION. Additionally, the VPHN option defines a VPA field that the OS can poll to determine if the associativity domain of the home node has changed. When the home node associativity domain changes, the OS might choose to call the H_HOME_NODE_ASSOCIATIVITY hcall() and adjust its resource allocations accordingly. R1--1. For the Virtual Processor Home Node option: The platform must support the H_HOME_NODE_ASSOCIATIVITY hcall() per the syntax and semantics specified in section . R1--2. For the Virtual Processor Home Node option: For the OS to operate properly across such functions as partition migration, the OS must be prepared for the target platform to not support the Virtual Processor Home Node option. R1--3. For the Virtual Processor Home Node option: The platform must support the “virtual processor home node associativity changes counters” field in the VPA per section . R1--4. For the Virtual Processor Home Node option: The platform must support the “Form 1” of the “ibm,associativity-reference-points” property per . The client program may call H_HOME_NODE_ASSOCIATIVITY hcall() with a valid identifier input parameter (such as from the device tree or from the ibm,configure-connector RTAS call) even if the corresponding virtual processor has not been started so that the client program can allocate resources optimally with respect to the to be started virtual processor.

H_HOME_NODE_ASSOCIATIVITY The H_HOME_NODE_ASSOCIATIVITY hcall() returns the associativity domain designation associated with the identifier input parameter. The client program may call H_HOME_NODE_ASSOCIATIVITY hcall() with a valid identifier input parameter (such as from the device tree or from the ibm,configure-connector RTAS call) even if the corresponding virtual processor has not been started so that the client program can allocate resources optimally with respect to the to be started virtual processor. Syntax: Parameters: Input: flags: Note: this parameter does not share format with the flags parameter of the Page Frame Table Access hcall()s. Defined Values: 0x0 Invalid 0x1 id parameter is as proc-no parameter of H_REGISTER_VPA hcall() 0x2 id parameter is as processor index from byte offsets 0x2-0x3 of a trace log buffer entry all other values reserved. id: processor identifier per the form indicated by the flags parameter. Output: R3: return code R4-R9: associativity domain identifier list of the specified processor’s home node. Only the “primary” connection (as would be reported in the first string of the “ibm,associativity” property) is reported. The associativity domain numbers are reported in the sequence they would appear in the “ibm,associativity” property; starting from the high order bytes of R4 proceeding toward the low order bytes of R9. Each of the registers R4-R9 is divided into 4 fields each 2 bytes long. The high order bit of each 2 byte field is a length specifier: 1: The associativity domain number is contained in the low order 15 bits of the field, 0: The associativity domain number is contained in the low order 15 bits of the current field concatenated with the 16 bits of the next sequential field) All low order fields not required to specify the associativity domain identifier list contain the reserved value of all ones. Semantics: Verify that the “flags” parameter is valid else return H_Parameter. Verify that the “id” parameter is valid for the “flags” and the partition else return H_P2. Pack the associativity domain identifiers for the home node associated with the “id” parameter starting with the highest level reported in the “ibm,associativity” property in the high order field of R4. All remaining fields through the low order field of R9 are filled with 0xFFFFFFFF. Return H_Success.

VPA Home Node Associativity Changes Counters For the VPHN option, the platform maintains within each VPA the Virtual Processor Home Node Associativity Change Counters field. See Table . This eight (8) byte field is maintained as 8 one byte long counters. The number of counters that are supported is implementation dependent up to 8, and corresponds to the entries in the form 1 of the “ibm,associativity-reference-points” property. If the platform implements fewer than 8 associativity reference points, only the corresponding low offset counters within the field are used and the remaining high offset counters within the field are unused. Should the associativity of the home node of the virtual processor change, for each changed associativity level that corresponds to a level reported in the “ibm,associativity-reference-points” property, the corresponding counter in the Virtual Processor Home Node Associativity Change Counters field is incremented.

Virtualizing Partition Memory This section describes the various high level functions that are enabled by the virtualization of the logical real memory of a partition. In principle, virtualization of partition memory can be totally transparent to the partition software; however, partition software that is migration aware can cooperate with the platform to achieve higher performance, and enhanced functionality.

Partition Migration/Hibernation Virtualizing partition memory allows a partition to be moved via migration or hibernation. In the case of partition migration from one platform to another, the source and destination platforms cooperate to minimize the time that the partition is non-responsive; the goal is to be non-responsive no more than a few seconds. In the case of hibernation, the intent is to put the partition to sleep for an extended period; during this time the partition state is stored on secondary storage for later restoration. R1--1. For the Partition Migration and Partition Hibernation options: The platform must implement the Partition Suspension option (See ). R1--2. For the Partition Migration and Partition Hibernation options: The platform must implement the VASI option (See ). R1--3. For the Partition Migration and Partition Hibernation options: The platform must implement the Update OF Tree option. R1--4. For the Partition Migration and Partition Hibernation options: The platform must implement the Version 6 Extensions of Event Log Format for all reported events (See ). R1--5. For the Partition Migration and Partition Hibernation options: The platform must prevent the migration/hibernation of partitions that own dedicated platform resources in addition to processors and memory, this includes physical I/O resources, the BSR facility, physical indicators and sensors (virtualized I/O, indicators (such as tone) and sensors (such as EPOW) are allowed). R1--6. For the Partition Migration and Partition Hibernation options: The platform must implement the Client Vterm option. R1--7. For the Partition Migration and Partition Hibernation options: The platform “timebase-frequency” must be 512 MHz. +/- 50 parts per million. R1--8. For the Partition Migration and Partition Hibernation options: The platform must present the “ibm,nominal-tbf” property (See ) with the value of 512 MHz. R1--9. For the Partition Suspension option: The platform must present the properties from , as specified by , to a partition. R1--10. For the Partition Suspension option: The presence and value of all properties in must not change while a partition is suspended except for those properties described by . Properties Related to the Partition Suspension Option Property Name Requirement “ibm,estimate-precision” Shall be present. “ibm,estimate-precision” shall contain the “fre”, “fres”, frsqrte”, and “frsqrtes” instruction mnemonics. “ibm,processor-page-sizes” Shall be present. “reservation-granule-size” Shall be present. “cache-unified” Shall be present if the cache is physically or logically unified and thus does not require the architected instruction sequence for data cache stores to appear in the instruction cache (See “Instruction Storage” section of Book II of PA); else shall not be present. “i-cache-size” Shall be present. “d-cache-size” Shall be present. “i-cache-line-size” Shall be present. “d-cache-line-size” Shall be present. “i-cache-block-size” Shall be present. “d-cache-block-size” Shall be present. “i-cache-sets” Shall be present. “d-cache-sets” Shall be present. “timebase-frequency” Shall be present if the timebase frequency can fit into the “timebase-frequency” property; else shall not be present. “ibm,extended-timebase-frequency” Shall be present if the timebase frequency cannot fit into the “timebase-frequency” property; else shall not be present. “slb-size” Shall be present. “cpu-version” Shall be present. “ibm,ppc-interrupt-server#s” Shall be present. “l2-cache” Shall be present if another level of cache exists; else shall not be present. “ibm,vmx” Shall be present if VMX is present for the partition; else shall not be present. “clock-frequency” Shall be present if the processor frequency can fit into the “clock-frequency” property; else shall not be present. “ibm,extended-clock-frequency” Shall be present if the processor frequency cannot fit into the “clock-frequency” property; else shall not be present. “ibm,processor-storage-keys” Shall be present. “ibm,processor-vadd-size” Shall be present. “ibm,processor-segment-sizes” Shall be present. “ibm,segment-page-sizes” Shall be present. “64-bit” Shall be present. “ibm,dfp” Shall be present if DFP is present for the partition; else shall not be present. “ibm,purr” Shall be present if a PURR is present; else shall not be present. “performance-monitor” Shall be present if a Performance Monitor is present; else shall not be present. “32-64-bridge” Shall be present. “external-control” Shall not be present. “general-purpose” Shall be present. “graphics” Shall be present. “ibm,platform-hardware-notification” Shall be present. “603-translation” Shall not be present. “603-power-management” Shall not be present. “tlb-size” Shall be present. “tlb-sets” Shall be present. “tlb-split” Shall be present. “d-tlb-size” Shall be present. “d-tlb-sets” Shall be present. “i-tlb-size” Shall be present. “i-tlb-sets” Shall be present. “64-bit-virtual-address” Shall not be present. “bus-frequency” Shall be present if the bus frequency can fit into the “bus-frequency” property; else shall not be present. “ibm,extended-bus-frequency” Shall be present if the processor frequency cannot fit into the “bus-frequency” property; else shall not be present. “ibm,spurr” Shall be present if an SPURR is present; else shall not be present. “name” Shall be present. “device_type” Shall be present. “reg” Shall be present. “status” Shall be present. “ibm,pa-features” Shall be present. “ibm,negotiated-pa-features” Shall be present “ibm,ppc-interrupt-gserver#s” Shall be present “ibm,tbu40-offset” Shall be present “ibm,pi-features ” Shall be present “ibm,pa-optimizations” Shall be present

Note on : The values of the properties in Table shall be consistent with implementation and design of the processor and the platform upon boot as well as before and after partition suspension. Programming Note: The “cpu-version” property may contain a logical processor version value. Therefore, code designed to handle processor errata should read the “ibm,platform-hardware-notification” property of the root node to obtain the physical processor version numbers allowed in the platform.

Virtualizing the Real Mode Area PA requires implementations to provide a Real Mode Area of memory that is accessed when not in hypervisor state (either MSR[HV] = 0, or MSR[HV] = 1 and MSR[PR] = 1) and the OS address translation mechanism is disabled (MSR[IR] = 0 or MSR[DR] = 0). PA provides mechanisms to allow the RMA to consist of discontiguous pages of selectable sizes. Such an RMA is known as a virtualized RMA. The H_VRMASD hcall() allows the OS to change the characteristics of the mappings the address translation mechanism uses to access a virtualized RMA.

H_VRMASD The caller may need to invoke the H_VRMASD hcall() multiple times for it to return with a return code of H_Success. Upon receiving a return code of H_LongBusyOrder10mSec, the caller should attempt to invoke H_VRMASD in 10 mSec with the same Page_Size_Code value used on the previous H_VRMASD hcall(). Invoking H_VRMASD with a different Page_Size_Code value indicates that the caller wants to transition to the Page_Size_Code value of the most recent H_VRMASD call. When changing the page size used to map the VRMA using the H_VRMASD hcall(), the caller is responsible for establishing HPT entries for any potential real mode accesses prior to calling H_VRMASD with a new value of Page_Size_Code, and maintaining any HPT entries for the old value of Page_Size_Code until the hcall() returns H_Success. R1--1. For the VRMA option: The platform must include the “ibm,vrma-page-sizes” property (See ) in the /cpu node. R1--2. For the VRMA option: The platform must implement the H_VRMASD hcall() following the syntax and semantics of . R1--3. For the VRMA option: In order to prevent a storage exception, the calling partition must establish page table mappings for the Real Mode Area using entries with a page size corresponding to the new Page_Size_Code value prior to making an H_VRMASD hcall() and must maintain the old page table mappings using the page size corresponding to the old Page_Size_Code value until the H_VRMASD hcall() returns H_Success. Syntax: Parameters: Page_Size_Code: A supported VRMASD field value. Supported VRMASD field values are described by the “ibm,vrma-page-sizes” property. Semantics: Verify that the Page_Size_Code parameter corresponds to a supported VRMASD field value; else return H_Parameter. If the Real Mode Area page size specified by the Page_Size_Code parameter does not match the operating RMA page size of the partition, then set the operating RMA page size of the partition to the value specified by the Page_Size_Code parameter and initiate the transition of the operating RMA page size of all active processing threads to the value specified by the Page_Size_Code parameter. If all active threads have transitioned to the partition operating RMA page size, then return H_Success; else return H_LongBusyOrder10mSec.

Cooperative Memory Over-commitment Option (CMO) The over-commitment of logical memory is accomplished by the platform reassigning pages of memory among the partitions to create the appearance of more memory than is actually present. This is commonly known as paging. While paging can, in certain cases, be accomplished transparently, significantly better memory utilization and platform performance can be achieved with cooperation from the partition OS. CMO introduces the following LoPAR terms: Expropriation: The act of the platform disassociating a physical page from a logical page. Subvention: The act of the platform associating a physical page with a logical page. Loaned Memory: Logical real memory that a partition lends to the hypervisor for reuse. The partition should not gratuitously access loaned memory as such accesses are likely to experience a significant delay. Memory entitlement: The amount of memory that the platform guarantees that the partition is able to I/O map at any given time. R1--1. For the CMO option: The partition must be running under the SPLPAR option. R1--2. For the CMO option: The platform must transparently (except for time delays) handle all effects of any memory expropriation that it may introduce unless the CMO option is explicitly enabled by the setting of architecture.vec option vector 5 byte 4 bit 0 (See for details). The CMO option consists of the following LoPAR extensions: Define ibm,architecture.vec-5 option Byte 4 bit 0 as “Client supports cooperative logical real memory over-commitment”. Define page usage states to assist the platform in selecting good victim pages and mechanisms to set such states. Extend the syntax and semantics defined for the I/O mapping hcall()s Return codes (H_LongBusyOrder1msec, H_LongBusyOrder10msec, and H_NOT_ENOUGH_RESOURCES) Return parameter extension for memory entitlement management Define a simulated Special Uncorrectable memory Error machine check for the case where a page can not be restored due to an error. R1--3. For the CMO option: The architected interface syntax and semantics of all LoPAR hcall()s and RTAS calls except as explicitly modified per the CMO option architecture must remain invariant when operating in CMO mode; any accommodation to memory over-commitment by these firmware functions (potentially any function that takes a logical real address as an input parameter) is handled transparently. Note: Requirement specifically applies to the debugger support hcall()s. For maximum performance benefit, an OS that indicates via the ibm,client-architecture-support interface that it supports the CMO option will strive to maintain in the “loaned” state (See ), the amount of logical memory indicated by the value returned in R9 from the H_GET_MPP hcall (See ), as well as provide page usage state information via the interfaces defined in and . The Extended Cooperative Memory Over-commitment Option (XCMO) provides additional features to manage page coalescing. These features are activated via setting architecture.vec vector 5 byte 4 bit 1 to the value of 1 in the ibm,client-architecture-support interface. Given that the platform supports the XCMO option, the CC flag for page frame table Accesses see and the H_GET_MPP_X hcall() see may be used by the OS. An OS might understand that a given page is a great candidate for page coalescing perhaps because the page contains OS and or common library code which is likely to be duplicated in other partitions; if so it might choose to set the Coalesce Candidate (CC) flag in the page table access or H_PAGE_INIT hcall()s as a hint to the hypervisor. Should a given logical page be mapped multiple times with conflicting Coalesce Candidate hints, the value in the last mapping made takes precedence. For a variety of reasons outside the scope of LoPAR, a platform supporting the XCMO option for a given platform might not actually perform page coalescing. If this is the case, the first return value from the H_GET_MPP_X hcall() see is the reserved value zero. R1--4. For the XCMO Option: The platform must implement the CMO Option. R1--5. Reserved for Compatibility For the XCMO Option: The platform must implement the CC (Coalesce Candidate) flag bit see . R1--6. For the XCMO Option: The platform must implement the H_GET_MPP_X hcall() see . R1--7. For the XCMO Option with the Partition Migration and Partition Hibernation options: to ensure proper operation after partition migration or hibernation, the OS must stop setting the CC flag bit see and stop calling the H_GET_MPP_X hcall() see prior to calling ibm,suspend-me RTAS and not do so again until after the OS has determined that the XCMO option is supported on the destination platform.

CMO Background (Informative) The following information is provided to be informative of the architectural intent. Implementations may vary, but should make a best effort to achieve the goals described. Ideally, the hypervisor does not expropriate any logical memory pages that it must later read in from disk; this is based upon the belief that the OS is in a better position to determine its working set relative to the available memory and page its memory than the hypervisor thus, when possible, the OS pager should be used. The ideal is approximated, since it cannot be achieved in all cases. The “Overage” is defined as the amount of logical address space that cannot be backed by the physical main storage pool. The overage is equal to the summation of the logical address space for all partitions using a given VRM main storage pool (the main storage that the hypervisor uses to back logical memory pages for a set of partitions) plus the high water mark of the hypervisor free page list (the free list high water mark is some implementation dependent ratio of the pool size) less the size of the VRM main storage pool. If the summation of the space freed by page coalescing and page donation is equal to the overage, in the steady state the hypervisor need not page. In reality the system is seldom, if ever, in the steady state, but with the free list pages the hypervisor has enough buffer space to take up most of the transient cases. Page coalescing is a transparent operation where in the hypervisor detects duplicate pages, directs all user reads to a single copy and may reclaim the other duplicate physical memory pages. Should the page owner change a coalesced page the hypervisor needs to transparently provide the page owner with a unique copy of the page. Read only pages are more likely to remain identical for a longer period of time and are thus better coalescing candidates. To set the value for the partition's page donation, the algorithm needs to be “fair” and responsive to the partition's “weight” so that more important work can be helped along. To be “fair”, the donation needs to be somewhat proportional to the partition's size since donating x pages is likely to cause greater pain to a small partition than a large one; yet the reason for “weight” is to cause greater pain to certain partitions relative to others. Thus the initial donation for a partition is set at the partition's logical address space size as a percentage of the total pool logical space subscription times the overage. Each implementation dependent time interval (say single digit seconds or so), the hypervisor randomly selects 100 pages from each partition and monitors how many of them were accessed during the next interval. This, after normalization to account for partition CPU utilization relative to its recent maximum, becomes an estimate of the partition's page utilization. It is expected that a partition with higher page utilization has a higher page fault rate and a lower percentage of its working set resident -- thus experiences more pain from VRM. The page utilization method described above may over estimate memory pressure in certain cases; specifically it may be slow to realize that the partition has gone idle. An idle partition reduces its CPU utilization which after normalization makes it appear that the partition memory pressure has risen rather than lowered. For this reason, the results of the page utilization method is further compared with the OS reported count of faults against pages that were previously swapped out as reported in offsets 0x180 - 0x183 of the VPA for each of the partition processors. The partition fault count when normalized with respect to processor cycles allows comparisons among the reported values from other partitions. Since the partition fault count is OS reported, and thus can not be trusted, it can not be the primary value used to determine page allocation, but since if the OS is misreporting the statistic, it is likely to be high, the memory pressure estimate derived from the OS reported fault counts can be used to reduce (but not increase) the partition memory allocation. Note since the hint might not be reported by a given OS, a filter should be put in place to detect that the OS is not reporting faults and appropriate default values substituted. This initial donation is then modified over time to force the pain of higher page utilization upon lower weight partitions based upon comparing the following ratios: A: The average partition page utilization over the last interval of all partitions in the pool / the partition's page utilization over the last interval B: The partition's weight/average partition weight of all partitions in the pool If A > B Increase the partition's donation by 1/256 of the partition's logical address space (limited to the partition’s logical address size) If A < B Reduce the partition's donation by 1/256 of the partition's logical address space provided that the summation of all donations >= Overage. The hypervisor maintains a per partition count of loaned pages (incremented when a page is removed from the PFT with a “loaned” state and decremented when/if the page state is changed) thus it can keep track of how well a partition is doing against the donation request that has been made of it. Partitions that do not respond to donation requests need to have their pages stolen to make up the difference. Pages that are “unused” or “loaned” are automatically applied to the free list. “Loaned” pages are expected to raise the partition's free list low water mark so that the OS only reclaims them in a transient situation which will then result in the OS paging out some of its own virtual memory to restore the total donation in the steady state. When the platform free list gets to the low water mark, pages are expropriated starting with the partition that has the greatest percentage discrepancy between its loaned plus expropriated count and is donation tax. The algorithm used is implementation dependent. The following is given for reference and is loosely based upon the AIX method. For this algorithm, pages that are newly restored are marked as “referenced” and all “unused” have already been harvested Step through the partition logical address space until either the hypervisor free list has gotten to its high water mark or the partition has been taxed to its donation. If the page is I/O mapped and not expropriatable, continue to the next page. If this is the first pass through the address space on this harvest, and the page is marked critical, continue to the next page. If the page is marked “referenced”, clear the reference bit and continue to the next page. If the page is backed in the VPM paging space and not modified since then, expropriate the page and continue to the next page. Queue the page to be copied into the VPM paging space. Thus partitions that keep up with their page donations seldom, if ever, experience a hypervisor page in delay. Those that do not keep up, will not get a free lunch and will be taxed up to the value of their assigned donation, with the real possibility that they will experience the pain of hypervisor page in delays.

CMO Page Usage States The CMO option defines a number of page states that are set by the cooperating OS using the flags parameter of the HPT hcall()s. The platform uses these page states to estimate the overhead associated with expropriating the specific page. Note: that the first two definitions below represent base background page states; the 3rd definition is the foreground state of I/O mapped which is acquired as result of an I/O mapping hcall (such as H_PUT_TCE); and the last two are caller specifiable state modifiers/extended semantics of the base states. Unused The page contains no information that is needed in the future, its contents need not be maintained by the platform, normally set only when the page is unmapped. Expropriation of “Unused” pages should be a low overhead operation. However, the OS is likely to reuse these pages which means that a clean free page will have to be assigned to the corresponding logical address. Active The page retains data that the OS has no reasonable way to regenerate. This is the state traditionally assumed by the OS when mapping a page. “Active” pages should be expropriated only as a last resort since they must be paged out and paged back in on a subsequent access. I/O Mapped The page is mapped for access by another agent. This state is the side effect of registration and/or I/O mapping functions. The page returns to its background state automatically when unmapped or deregistered. Pages in the I/O Mapped state normally may not be expropriated since they are potentially the target of physical DMA operations. Critical The page is critical to the performance of the OS, and the hypervisor should avoid expropriating such pages while other pages are available. Expropriating pages marked “Critical” may result in the OS being unable to meet its performance goals. Loaned The page contains no information and the OS warrants that it will not gratuitously access this page such that the hypervisor may expect to use it for an extended period of time. When the OS does access the page, it is likely that the access will result in a subvention delay. Expropriating pages in the “Loaned” state should result in the lowest overhead. R1--1. For the CMO option: The platform must at partition boot initialize the page usage state of all platform pages to “Active”. R1--2. For the CMO option: The platform must preserve data in pages that are in the “Active” state. R1--3. For the CMO option: When the OS accesses a page in the “Unused” state, the platform must present either the preserved page data or all zeros. R1--4. For the CMO option: When the OS specifies as input to an I/O mapping or the H_MIGRATE_DMA hcall() a page in either the “Unused” or “Loaned” states, the platform must upgrade the page’s background page state to “Active”.

Setting CMO Page Usage States using HPT hcall() flags Parameter The CMO option defines additional flags parameter combinations for the HPT hcall()s that take a flags parameter. Turning on flags bit 28 activates the changing of page state. Leaving bit 28 at the legacy value of zero maintains the page state setting, thus allowing legacy code to operate unmodified with all pages remaining in the initialized “Active” state. R1--1. For the CMO option: The platform must extend the syntax and semantics of the HPT access hcall()s that take a flags parameter, see , to set the page usage state of the specified page per . HPT hcall()s extended with CMO flags hcall

CMO Page Usage State flags Definition Flag bit 28 Flag bits 29 - 30 Flag bit 31 Comments 0 Don’t Care Don’t Care Inhibit Page State Change 1 00 0 Set page state to Active 00 1 Set page State to Active Critical 01 both 0 and 1 Reserved 10 both 0 and 1 Reserved 11 0 Set page state to Unused 11 1 Set page state to Unused Loaned

Setting CMO Page Usage States with H_BULK_REMOVE R1--1. For the CMO option: The platform must extend the syntax and semantics of the H_BULK_REMOVE hcall (see ) to set the page usage state of the specified pages per . H_BULK_REMOVE Translation Specifier control/status Byte Extended Definition for CMO Option 0 1 2 3 4 5 6 7 Bit Numbers type code 0 0 r0 r0 r0 r0 r0 r0 Unused 0 1 page state r0 r0 req. mod. Request 0 0 Absolute 0 1 andcon 1 0 APVN 1 1 not used 0 0 Inhibit page usage state change 0 1 Reserved 1 0 For CMO option set page usage state to “Unused” if Success 1 1 For CMO option set page usage state to “Loaned” if Success 1 0 return code Response 0 0 R C r r Success 0 1 r r Not Found 1 0 H_PARM 1 1 H_HW 1 1 Reserved (to be zero) End of String Legion R=Reference Bit, C=Change Bit, r=reserved ignore, r0=reserved to be zero

CMO Extensions for I/O Mapping Hcall()s If an OS were to map an excessive amount of its memory for potential physical DMA access, little of its memory would be left for paging; conversely, if the OS was totally prevented from I/O mapping its memory, it could not do I/O operations. The CMO option introduces the concept of memory entitlement. The partition’s memory entitlement is the amount of memory that the platform guarantees that the partition is able to I/O map at any given time. A given page may be mapped multiple times through different LIOBNs yet it only counts once against the partition’s I/O mapping memory entitlement. The syntax of certain I/O mapping hcall()s is extended to return the change in the partition’s I/O mapped memory total. The entitlement is intended to be used to ensure forward progress of I/O operations. R1--1. For the CMO option: When the partition is operating in CMO mode, the platform must extend the syntax and semantics of the I/O mapping hcall()s specified in as per the specifications in and . I/O Mapping hcall()s Modified by the CMO Option. hcall() Base Definition on H_PUT_TCE H_STUFF_TCE H_PUT_TCE_INDIRECT

Note: The I/O mapping hcalls H_PUT_RTCE and H_PUT_RTCE_INDIRECT do not change the number of pages that are I/O mapped since they simply create copies of the I/O mappings that already exist.

CMO I/O Mapping Extended Return Codes R1--1. For the CMO option: The platform must ensure that the DMA agent operating through the I/O mappings established by the hcall()s specified in can appear to successfully access the associated page data of any expropriated page referenced by the input parameters of the hcall() prior to returning the code H_Success. R1--2. For the CMO option: The platform must either extend the return code set for the hcall()s specified in to include H_LongBusyOrder1msec and/or H_LongBusyOrder10msec or transparently suspend the calling virtual processor for cases where the function is delayed pending the restoration of an expropriated page. R1--3. For the CMO option: The platform must extend the return code set for the hcall()s specified in to include H_NOT_ENOUGH_RESOURCES for cases where the function would cause more memory to be I/O mapped than the caller is entitled to I/O map and the platform is incapable of honoring the request.

CMO I/O Mapping Extended Return Parameter The syntax and semantics of the hcall()s in are extended when the partition is operating in CMO mode by returning in register R4 the change in the partition’s total number of I/O mapped memory bytes due to the execution of the hcall(). The number may be positive (increase in the amount of memory mapped) negative or zero (the page was/remains mapped for I/O access by another agent). R1--1. For the CMO option: The platform must extend the syntax and semantics of the hcall()s specified in when operating in CMO mode, to return in register R4 the change to the total number of bytes that were I/O mapped due to the hcall().

H_SET_MPP This hcall() sets, within limits, the partition’s memory performance parameters. If the request parameter exceeds the constraint of the calling LPAR’s environment, the hypervisor limits the value set to the constrained value and returns H_Constrained. The memory weight is architecturally constrained to be within the range of 0-255. Syntax: Semantics: Verify that the memory performance parameters specified are within the constraints of the partition: If yes, atomically set the partition’s memory performance parameters per the request and return H_Success. If not, set the partition’s memory performance parameters as constrained by the partition’s configuration and return H_Constrained. R1--1. For the CMO option: The platform must initially set the partition memory performance parameters to their configured maximums at partition boot time. R1--2. For the CMO option: The platform must implement the H_SET_MPP hcall() following the syntax and semantics of . R1--3. For the CMO option: The platform must constrain the partition memory weight to the range 0-255.

H_GET_MPP This hcall() reports the partition’s memory performance parameters. The returned parameters are packed into registers.

Command Overview Syntax: Semantics: Place the partition’s memory performance parameters for the calling virtual processor’s partition into the respective registers: R4: The number of bytes of main storage that the calling partition is entitled to I/O map. In the case of a dedicated memory partition this shall be the size of the partition’s logical address space. R5: The number of bytes of main storage that the calling partition has I/O mapped. In the case of a dedicated memory partition this is not applicable which is represented by the code -1. R6: The calling partition’s virtual partition memory aggregation identifier numbers, up to 4 levels: Bytes 0-1: Reserved for future aggregation definition, and set to zero -- in the future this field may be given meaning. Bytes 2-3: Reserved for future aggregation definition, and set to zero -- in the future this field may be given meaning. Bytes 4-5: 16 bit binary representation of the “Group Number”. Bytes 6-7: 16 bit binary representation of the “Pool Number”. In the case of a dedicated memory partition the “Pool Number” is not applicable which is represented by the code 0xFFFF. R7: Collection of short memory performance parameters for the calling partition: Byte 0: Memory weight (0-255). In the case of a dedicated processor partition this is not applicable which is represented by the code 0. Byte 1: Unallocated memory weight for the calling partition’s aggregation. Bytes 2-7: Unallocated I/O mapping entitlement for the calling partition’s aggregation divided by 4096. R8: The calling partition’s memory pool main storage size in bytes. In the case of a dedicated processor partition this is not applicable which is represented by the code -1. R9: The signed difference between the number of bytes of logical storage that are currently on loan from the calling partition and the partition’s overage allotment (a positive number indicates a request to the partition to loan the indicated number of bytes else they will be expropriated as needed). R10: The number of bytes of main storage that is backing the partition logical address space. In the case of a dedicated processor partition this is the size of the partition’s logical address space. Return H_Success. R1--1. For the CMO option: The platform must implement the H_GET_MPP hcall() following the syntax and semantics of .

H_GET_MPP_X This hcall() provides additional information over and above (not duplication of) that which is returned by the H_GET_MPP hcall() . The syntax of this hcall() is specifically designed to be seamlessly extensible and version to version compatible both from the view of the caller and the called on an invocation by invocation basis. To this end, all return registers (R3 (return code) through R10) are defined from the outset, some are defined as reserved and are set to zero upon return by the hcall(). The caller is explicitly prohibited from assuming that any reserved register contains the value zero, so that there will be no incompatibility with future versions of the hcall() that return non-zero values in those registers. New definitions for returned values will define the value zero to indicate a benign or unreported setting. Syntax: Semantics: Place the partition’s extended memory performance parameters for the calling virtual processor’s partition into the respective registers: R4: The number of bytes of the calling partition’s logical real memory coalesced because they contained duplicated data. R5: If the calling partition is authorized to see pool wide statistics (set by means that are beyond the scope of LoPAR) then The number of bytes of logical real memory coalesced because they contained duplicated data in the calling partition’s memory pool else set to zero. R6:: If the calling partition is authorized to see pool wide statistics (set by means that are beyond the scope of LoPAR) then PURR cycles consumed to coalesce data else set to zero. R7: If the calling partition is authorized to see pool wide statistics (set by means that are beyond the scope of LoPAR) then SPURR cycles consumed to coalesce data else set to zero. R8: If the calling partition is authorized to see pool wide statistics (set by means that are beyond the scope of LoPAR) then, the total number of the calling partition’s memory pool bytes currently in use backing the pool's partition logical memory (this value represents the net usage after any and all savings from deduplication or any other future means the hypervisor may employ) else set to 0. R9: Reserved shall be set to zero - shall not be read by the caller R10: Reserved shall be set to zero - shall not be read by the caller Return H_Success: R1--1. For the XCMO option: If the platform coalesces memory pages that contain duplicated data it must implement the H_GET_MPP_X hcall() following the syntax and semantics of . R1--2. For the XCMO option: the caller must be prepared for H_GET_MPP_X to return H_Function or to have a return parameter that was previously non-zero be consistently returned with the value zero if the caller wishes to operate properly in a partition migration or fail-over environment.

Restoration Failure Interrupt R1--1. For the CMO option: When the platform experiences an unrecoverable error restoring the association of a physical page with an expropriated logical page following an attempted access of the expropriated page by the partition, the platform must signal a Machine Check Interrupt by returning to the partition’s interrupt vector at location 0x0200. Note the subsequent firmware assisted NMI and check exception processing returns a VPM SUE error log (See ).

H_MO_PERF This hcall() applies an artificial memory over-commitment to the specified pool while monitoring the pool performance for overload, removing the applied over-commitment if an overload trigger point is reached. The overload trigger point is designed to double as a dead man switch, eventually ending the over-commitment condition should the experiment terminate ungracefully. Only the partition that is authorized to run platform diagnostics is authorized to make this call. Syntax: Semantics: This description is based upon the architectural model of , and must be adjusted to achieve the intent for the specific implementation. Validate that the caller has the required authority; else return H_AUTHORITY. Validate that the pool parameter references an active memory pool else; return H_Parameter. Raise the pool’s free list low water mark above its base value by the signed amount in the mem parameter. (The result is constrained to not less than the base low water mark value and no more then the amount of memory in the pool.) Change the permissible pool memory low event counter by the signed value of the lows parameter. Return in R4 the accumulated rise in the pool’s free list low water mark above its base value. Return in R5 the current value of the permissible pool memory low event counter. On each subsequent low memory event (page allocation where the free list is at or below the low water mark), the permissible pool memory low event counter is decremented. Should the counter ever reach zero, the pool’s free list low water mark is returned to its base value.

Expropriation/Subvention Notification Option The Expropriation/Subvention Notification Option (ESN) sub option of the CMO option allows implementing platforms to notify supporting OS’s of delays due to their access of an expropriated VPM page (such as would be experienced during a “page in” operation). With an expropriation notification, the OS may block the affected process and dispatch another rather than having the platform block the virtual processor that happened to be running the affected process. An expropriation notification is paired with a subsequent subvention notification signaled when the original access succeeds. Additionally new page states allow the OS to indicate pages that it can restore itself, thus relieving the platform from the burden of making copies of those pages when they are expropriated and potentially side stepping the “double paging problem” wherein the platform pages in a page in response to a touch operation in preparation for an OS page in only to have the OS immediately discard the page data without looking at it. The ESN option includes the following LoPAR extensions: Define augmented CMO page states Define the per partition Subvention Notification Structure (SNS) Define H_REG_SNS hcall() to register the SNS Define Expropriation Notification field definitions within the VPA Define expropriation and subvention event interrupts. R1--1. For the ESN option: The partition must be running under the CMO option. R1--2. For the ESN option: The platform must ignore/disable all other ESN option functions and features unless the OS has successfully registered the Subvention Notification Structure via the H_REG_SNS hcall. See for details.

ESN Augmentation of CMO Page Usage States The ESN option augments the set of page states defined by the CMO option that are set by the cooperating OS using the flags parameter of the HPT hcall()s. The platform uses these page states to estimate the overhead associated with expropriating the specific page. Active- An Expropriation notification on this type of page allows the OS to put the using process to sleep, until the page is restored, as signaled by a corresponding subvention notification, at which time the affected instruction is retried. Expendable- the page retains data that the OS can regenerate, for example, a text page that is backed up on disk; usually the page is mapped read only. A reflected expropriation notification on this type of page requires the OS to restore the page - thus the platform presents somewhat different interrupt status from that used by an Active page. Expropriating an “Expendable” page should result in lower overhead than expropriating an “Active” page since the contents need not be paged out before the page is reused. An Expendable page that is Bolted while not illegal has to be treated as an “Active” page since an access to a Bolted page may not result in an expropriation notification. Latent- the page contains data that the OS can regenerate unless the contents have been modified - at which time the page state appears to be “Active”, this is similar to “Expendable” for pages mapped read/write. For example, a page of a mapped file. Expropriation Notification is like “Active” or “Expendable” above. Loaned- When the OS does access the page, it is likely that the access will result in an expropriation notification ESN Augmentation of CMO Page Usage State flags Definition Flag bit 28 Flag bits 29 - 30 Flag bit 31 Comments 1 01 0 Set page state to Latent Note: If Expropriation Notification is disabled, or the bolted bit (HPT bit 59) is set to 1, the page state to Active (Active Critical if flag bit 31=1). 01 1 Set page State to Latent Critical 10 0 Set page state to Expendable 10 1 Set page state to Expendable Critical

Expropriation Notification Under the ESN option, notice of an attempt to access an expropriated page is given when the Expropriation Interrupt is enabled in the virtual processor VPA. Additionally the virtual processor VPA Expropriation Correlation Number and Expropriation Flags fields are set to allow the affected program to determine when the access may succeed and if the program needs to restore data to the Subvened page, see details in . Once the VPA has been updated, the platform presents an Expropriation Fault interrupt to the affected virtual processor see details in .

ESN VPA Fields R1--1. For the ESN option: The platform must support the VPA field definitions of , , and . VPA Byte Offset 0xB9 0 1 2 3 4 5 6 7 Bit Number Reserved (0) 0 Dedicated processor cycle donation inhibited 1 Dedicated processor cycle donation enabled 0 Expropriation interrupt disabled 1 Expropriation interrupt enabled

Firmware Written VPA Starting at Byte Offset 0x178 0x178 F 0x179 0x17A 0x17B 0x17C 0x17D 0x17E 0x17 Reserved for firmware locks Reserved Expropriation Correlation Number Field Expropriation Flags -- See .

Note: The Expropriation Flags and Expropriation Correlation Number Fields are volatile with respect to Expropriation Notifications thus it should be saved by the OS before executing any instruction that may access unbolted pages. Expropriation Flags at VPA Byte Offset 0x17D 0 1 2 3 4 5 6 7 Bit Number Reserved (0) 0 The Subvened page data is/will be zero 1 The Subvened page data will be restored.

Expropriation Interrupt When the platform is running with real memory over-commitment, eventually a partition virtual processor will access a stolen page. The transparent solution is to block the virtual processor until the platform has restored the page. By enabling the Expropriation Interrupt via the Expropriation Interrupt Enable field of the VPA (see ) the cooperating OS indicates that it is prepared to make use of its virtual processors for other purposes during the page restoration and/or restore the contents of “expendable” and unmodified “latent” pages. R1--1. For the ESN option: When the partition accesses an expropriated page and either the page was bolted (PTE bit 59=1) or the Expropriation Interrupt Enable bit of the affected virtual processor’s VPA is off see , then the platform must recover the page transparently without an Expropriation Interrupt. R1--2. For the ESN option: When the partition accesses an expropriated page and the summation of the partition’s in use subvention event queue entries plus outstanding subvention events is equal to or greater than the size of the partition’s subvention event queue, the platform must recover the page prior to issuing any associated Expropriation Interrupt. Note: Requirement prevents the overflow of the subvention queue. R1--3. For the ESN option: When the partition accesses an expropriated “Unused” or “Expendable” page, the platform must, unless prevented by , set bit 7 of the affected processor’s Expropriation Flags VPA byte (see ) to 0b0; else the platform must set the bit to 0b1. R1--4. For the ESN option: When the partition accesses an expropriated page and the platform associates a physical page with the logical page prior to returning control to the affected virtual processor, the platform must, unless prevented by , set the Expropriation Correlation Number field of the affected virtual processor’s VPA to 0x0000 (see ). R1--5. For the ESN option: When the partition accesses an expropriated page, the platform is not prevented by , does not associate a physical page with the logical page prior to returning control to the affected virtual processor, and the restoration of the logical page has NOT previously been reported to the OS with an expropriation notification, the platform must, set the Expropriation Correlation Number field of the VPA to a non-zero unique value for all outstanding recovering pages for the affected partition. R1--6. For the ESN option: When the partition accesses an expropriated page, the platform is not prevented by , does not associate a physical page with the logical page prior to returning control to the affected virtual processor, and the restoration of the logical page has previously been reported to the OS with an expropriation notification, the platform must set the Expropriation Correlation Number field of the VPA to the same value as was supplied with the previous expropriation notification event associated with the outstanding recovering page for the affected partition. R1--7. For the ESN option: When the partition performs an instruction fetch from an expropriated page, the platform must, unless prevented by , signal the affected virtual processor with an Expropriation Interrupt by returning to the affected virtual processor’s interrupt vector at location 0x0400 with the processor’s MSR, SRR0 and SRR1 registers set as if the instruction fetch had experienced a translation fault type of Instruction Storage Interrupt except that SRR1 bit 46 (Trap) is set to a one. R1--8. For the ESN option: When the partition performs a load or store instruction that accesses an expropriated page, the platform must, unless prevented by , signal the affected virtual processor with an Expropriation Interrupt by returning to the affected virtual processor’s interrupt vector at location 0x0300 with the processor’s MSR, DSISR, DAR, SRR0 and SRR1 registers set as if the storage access had experienced a translation fault type of Data Storage Interrupt except that SRR1 bit 46 (Trap) is set to a one.

ESN Subvention Event Notification ESN uses an event queue within the Subvention Notification Structure (SNS) to notify the OS of page subvention operations. Subvention events have a two byte SNS-EQ entry which has the value of the expropriation correlation number from the associated expropriation notification event R1--1. For the ESN option: The platform must implement the structures, syntax and semantics described in , , and .

SNS Memory Area R1--1. For the ESN option: The platform must support the 4K byte aligned SNS not spanning its page boundary defined by . Subvention Notification Structure Access Offset Usage Written by OS Read by Hypervisor 0x00 Bit Notification Control 0 Notification Trigger 1-7 Reserved Written by Hypervisor Read by OS 0x01 Bit Event Queue State 0 0 = Operational 1 = Overflow 1-7 Reserved Set to non-zero by Hypervisor Read and cleared to zero by OS 0x02-0x02 First SNS EQ Entry . . . (SNS Length -2) - SNS Length - 1) Last SNS EQ Entry

SNS Registration (H_REG_SNS) Syntax: Semantics: If the Address parameter is -1 then deregister any previously registered SNS for the partition, disable ESN functions and return H_Success. (Care is required on the part of the OS not to create any Restoration Paradox Failures prior to registering a new SNS. See for details.) If the Shared Logical Resource option is implemented and the Address parameter represents a shared logical resource that has been rescinded, then return H_RESCINDED. If the Address parameter is not 4K aligned in the valid logical address space of the caller, then return H_Parameter. If the Length parameter is less than 256 or the Address plus Length spans the page boundary of the page containing the starting logical address, then return H_P2. Register the SNS structure for the calling partition by saving the partition specific information: Record the SNS starting address Record the SNS ending address Record the next EQ entry to fill address (SNS starting address +2) Set the SNS interrupt toggle = SNS Notification Trigger Set the SNS Event Queue State to “Operational” Return: R3: H_Success R4: Value to be passed in the “unit address” parameter of the H_VIO_SIGNAL hcall() to enable/disable the virtual interrupt associated with the transition of the SNS from empty to non-empty. R5: Interrupt source number associated with the SNS empty to non-empty virtual interrupt.

SNS Event Processing The following sequence is used by the platform to post an SNS event. The SNS-EQ used corresponds to the EEN event type. This sequence refers to fields described in . If the SNS EQ overflow state is set, exit. /* An EQ overflow drops all new events until software recovers the EQ*/ Using atomic update protocol, store the event identifier into the location indicated by the SNS next EQ entry to fill pointer if the original contents of the location were zero; else set the associated EQ overflow state and exit. /* The value of zero is reserved for an unused entry -- an EQ overflow drops the new event */ Increment the SNS next EQ entry to fill pointer by the size of the EQ entry (2) modulo the size of the EQ /* Adjust fill pointer */ If the SNS interrupt toggle = SNS Notification Trigger then exit. /* Exit on no event queue transition */ Invert the SNS interrupt toggle. /* Remember event queue transition */ If the SNS interrupt is enabled, signal a virtual interrupt to the associated partition. /* Signal transition when enabled */

ESN Interrupts The ESN option may generate several interrupts to the partition OS. Defined in this section are those in addition to the Expropriation Notification interrupts defined above.

Subvention Notification Queue Transition Interrupt R1--1. For the ESN option: When the platform has restored the association of a physical page with the logical page that caused an Expropriation Notification interrupt with a non-zero Expropriation Correlation Number, the platform must post the corresponding Expropriation Correlation Number to the Subvention Event Queue see .

Restoration Paradox Failure Restoration Paradox Failures result in an unrecoverable memory failure machine check. R1--1. For the ESN option: When the platform finds that Expropriation Notification has been disabled after it has discarded the contents of an “Expendable” page, it must treat any access to such a page as an unrecoverable error restoring the association of a physical page with the expropriated logical page.

Virtual Partition Memory Pool Statistics Function Set The hcall-vpm-pstat function set may be configured via the partition definition in none or any number of partitions as the VPM administrative policy dictates.

H_VPM_PSTAT This hcall() returns statistics on the physical shared memory pool. Since these statistics can be manipulated by the processing of a single partition, there is a risk of creating a covert channel through this call. To mitigate this risk, the call is contained in a separate function set that can be protected by authorization methods outside the scope of LoPAR. Syntax: Parameters: Input: None Output: R4: Total VM Pool Page Faults R5: Total Page Fault Wait Time (Time Base Ticks) R7: Total Pool Physical Memory R8: Total Pool Physical Memory that is I/O mapped R9: Total Logical Real Memory that is Virtualized by the VM Pool Semantics: Verify that calling partition has the authority to make the call else return H_Authority. Report the statistics for the memory pool used to instantiate the virtual real memory of the calling partition. Place in R4 the summation of the virtual partition memory page faults against the memory pool since the initialization of the pool. Place in R5 the summation of timebase ticks spent waiting for the page faults indicated in R4. Place in R6 the total amount of physical memory in the memory pool. Place in R7 the summation of the entitlement of all active partitions served by the memory pool. Place in R8 the summation of the I/O mapped memory of all active partition served by the memory pool. Place in R9 the summation of the logical real memory of all active partitions served by the memory pool. Return H_Success.

Logical Partition Control Modes Selected logical partition control modes may be modified by the client program.

Secondary Page Table Entry Group (PTEG) Search The page table search algorithm, described by the , consists of searching for a Page Table Entry (PTE) in up to two PTEGs. The first PTEG searched is the “primary PTEG”. If a PTE match does not occur in the primary PTEG, the hardware may search the “secondary PTEG”. If a PTE match is not found in the searched PTEGs, the hardware signals a translation exception. Code is not required to place any PTEs in secondary PTEGs. Therefore, if a PTE match does not occur in a primary PTEG there is no need for the hardware to search a secondary PTEG to determine that a search has failed. The “Secondary Page Table Entry Group” bit of “ibm,client-architecture-support” allows code to indicate that there is no need to search secondary PTEGs to determine that a PTE search has failed.

Memory Table Translation Option Exploitation Starting with platforms build upon POWER processors supporting ISA level 3.0, the platform supports the In-Memory Table Translation option. This option allows the memory management unit to perform effective to physical address translation based off of a single tree of in-memory translation tables, rooted by a single physical memory address pointer. The option also supports two level radix tree page tables as well as traditional POWER hash page tables. As initially configured, partitions that use hash page tables run with legacy Segment Lookaside Buffers (LPCR [UPRT] = FALSE). To fully exploit the In-Memory Table Translation option, the hash page table client program registers a process table from its own memory which sets (LPCR [UPRT] = TRUE). On the other hand, radix page table client programs need to register a process table before they turn on address translation. Each guest partition in the system, may register, within the tree of in-memory translation tables, its own table (process table) which controls translation of its process effective addresses to guest virtual / real. The process table is then used by nest memory management unit for nest accelerator and CAPI attached device accesses, a nd optionally for processor memory management unit translations. Additionally the platform might support the client program to directly invalidate cached process table translation data (when the client program modifies the in-memory table). If the platform does not support the client program directly issuing process table cache invalidate instructions, then the client program must use the set of in-memory table cache invalidate hcall()s in sections and . Note: The CAS option vector processing associated with the In-Memory Table Translation option (vector 5 byte 23) carries a special semantic.

H_CLEAN_SLB The Segment Lookaside Buffers (SLB) are a software managed coherency cache of the per process segment table. The client may directly issue instructions to clear the SLB on the issuing processor; however, clearing entries on other processors or the nest memory management unit requires hypervisor assistance. The H_CLEAN_SLB hcall() provides the client program with the means for clearing SLB contents that might be stale. The platform provides through the flags parameter options as to the scope of the entries that are cleaned, these include: Clean the nest MMU SLBs of all entries associated with a specified caller process (esid parameter is set to zero). Clean all platform SLBs of a specific ESID for a specified caller process (flags parameter C and B fields specify SLB Class and Size respectively). Syntax: Semantics: If a reserved flags bit == TRUE return H_Parameter If flags[62] == flags[63] return H_Parameter If flags[63] and esid <> 0 return H_P3 Validate that the calling partition is not mounting a denial of service attack else return H_LongBusyOrder1mSec. Perform the following sequence: ptesync If flags[62] then slbieg with RS = pid parameter || caller’s LPID, RB= esid || C || 0b0 || B || 0b0 || 0x000000 If flags[63] then slbiag with RS = pid parameter || caller’s LPID eieio slbsync ptesync Return H_SUCCESS.

H_INVALIDATE_PID The H_INVALIDATE_PID hcall() invalidates any system translate lookaside buffer entries from the caller’s specified (pid parameter) process table entry. Syntax: Semantics: If flags [0:61] <> 0 return H_Parameter Validate that the calling partition is not mounting a denial of service attack else return H_LongBusyOrder1mSec. RB = 0x400 /* Invalidation Selector (IS) = 01 (Invalidate matching PID.) */ If flags[62] then RB = RB + RB /* Invalidation Selector (IS) = 10 (Invalidate matching LPID.) */ Perform the following sequence: ptesync tlbie (RIC=2, PRS=1, R=flags[63]), RS=pid||caller’s_LPID, RB eieio tlbsync ptesync Return H_SUCCESS.

H_REGISTER_PROCESS_TABLE This hcall() is used by the client program to manage the its virtual address translation mode including registration of its process table. The calling program needs to be prepared for the change in address translation that is being requested, for instance, the calling program might choose to be running with relocation off and with all other processors either spinning with relocation off or in the stopped state. The caller might need to invoke the H_REGISTER_PROCESS_TABLE hcall() multiple times for it to return with a return code of H_Success. Upon receiving a return code of H_LongBusyOrder10mSec, the caller should attempt to invoke H_REGISTER_PROCESS_TABLE in 10mSec with the same parameter values used on the previous H_REGISTER_PROCESS_TABLE hcall(). Invoking H_REGISTER_PROCESS_TABLE with a different parameter values indicates that the caller wants to transition to the parameter values of the most recent H_REGISTER_PROCESS_TABLE call. The platform may implement a subset of the functions implied by the flags parameter definition below. This subset is reported in the value of byte 23 of the “ibm,architecture-vec-5” property of the /chosen node. A request for an unimplemented function results in an H_Parameter return code. Syntax: 0b11 then the following parameters shall */ /* be = 0 else */ uint64 base, /* Base address of the process table */ /* For flags 61 = 0 the VSID number of a one terabyte */ /* segment (right justified in the register) */ /* For flags 61 = 1 the 4K aligned guest real address */ uint64 page_size, /* For flags 61 = 0 Size of the pages within the table */ /* encoded as per the L||LP device tree encoding */ /* else = 0 */ uint64 table_size); /* Size of the process table in bytes */ /* Encoded as the integer */ /* (log2 (total table length in bytes)) – 12 */ /* (table_size <= 24) */]]> Semantics: Validate that no reserved flags parameter bits are TRUE and that the defined bits setting is supported else RETURN H_Parameter. If “flags” indicate change to process table (flags[59] is TRUE) then: If “flags” indicate deregistration (flags[58] is FALSE) then set Partition_Table[calling-partition,word_2] to a platform dependent benign value; else – Based upon the mode specified in “flags[61-62]”: Validate “base” parameter else RETURN to H_P2 Validate “page_size” parameter relative to platform support else RETURN H_P3 Validate (0 => “table_size” parameter <=24) else RETURN H_P4 Set Partition_Table[calling-partition,word_2] to the value specified by the “flags[61-62]”, “base”, “page_size”, and “table_size” parameters. Endif Endif If “flags” indicate HPT/SLB mode (flags[61-62] is 0b00) then set LPCR[UPRT] to FALSE else set LPCR[UPRT] to TRUE Input parameters: The “flags” parameter communicates the desired operation. The “base” parameter specifies the VSID number of a one terabyte segment (right justified in the register). The “page_size” parameter specifies the size of the pages within the table encoded as per the L||LP encoding used by the HPT hcalls that is presented in the page size info in the device tree. The “table_size” parameter specifies the total size of the process table encoded as the integer (log2 (total table length in bytes)) – 12 (table_size <= 24).

Partition Energy Management Option (PEM) This section describes the functional interfaces that are available to assist the partition Operating System optimize trade offs between energy consumption and performance requirements.

Long Term Processor Cede To enable the hypervisor to effectively reduce the power draw from unused partition processors, the concept of cede wakeup latency is introduced with the Partition Energy Management Option. A one byte cede latency specifier VPA field communicates the maximum latency class that the OS can tolerate on wakeup from H_CEDE. In general the longer the wakeup latency the greater the savings that can be made in power drawn by the processor during a cede operation. However, due to implementation restrictions, the platform might be unable to take full advantage of the latency that the OS can tolerate thus the cede latency specifier is considered a hint to the platform rather than a command. The platform may not exceed the latency state specified by the OS. Calling H_CEDE , with value of the cede latency specifier set to zero denotes classic H_CEDE behavior. Calling H_CEDE with the value of the cede latency specifier set greater than zero allows the processor timer facility to be disabled (so as not to cause gratuitous wake-ups - the use of H_PROD, or other external interrupt is required to wake the processor in this case). An External interrupt might not awake the ceded process at some of the higher (above the value 1) cede latency specifier settings. Platforms that implement cede latency specifier settings greater than the value of 1 implement the cede latency settings system parameter see . The hypervisor is then free to take energy management actions with this hint in mind. R1--1. For the PEM option: The platform must honor the OS set cede latency specifier value per the definition of . R1--2. For the PEM option: The platform must map any OS set cede latency specifier value into one of its implemented values that does not exceed the latency class set by the OS. R1--3. For the PEM option: The platform must implement the cede latency specifier values of 0 and 1 per . R1--4. For the PEM option: If the platform implements cede latency specifier values greater than 1 it must implement the cede latency settings values sequentially without holes. R1--5. For the PEM option: If the platform implements cede latency specifier values greater than 1 each sequential cede latency settings value must represent a cede wake up latency not less than its predecessor, and no less restrictive than its predecessor. R1--6. For the PEM option: If the platform implements cede latency specifier values greater than 1 it must implement the cede latency settings system parameter see .

H_GET_EM_PARMS This hcall() returns the partition’s energy management parameters. The return parameters are packed into registers. Programming Note: On platforms that implement the partition migration option, after partition migration: The support for this hcall() might change, the caller should be prepared to receive an H_Function return code indicating the platform does not implement this hcall(). Fields that were defined as “reserved” might contain data; calling code should be tested to ensure that it ignores fields defined as “reserved” at the time of its design, and that it operates properly when encountering “zeroed” defined fields that indicate that the field does not contain useful data. Implementation Note: To aid the testing of calling code, implementations would do well to include debug tools that seed reserved return fields with random data. Syntax: Parameters: (on return) Status Codes (bit offset within 2 byte field): Bits 0:5 Reserved (zero) Bits 6:8 Energy Management major code: 0b000: Non-floor modes: Bits 9:15 Energy Management minor code: 0x00: The energy management policy for this aggregation level is not specified. 0x01: Maximum Performance (Energy Management enabled - performance may exceed nominal) 0x02: Nominal Performance (Energy Management Disabled) 0x03: Static Power Saving Mode 0x04: Deterministic Performance (Energy Management enabled - consistent performance on a given workload independent of environmental factors and component variances) 0x05 - 0x7F Reserved 0b001: Dynamic Power Management: Bits 9:15 Performance floor as a percentage of nominal (0% - 100%). 0b010:0b111 Reserved Implementation Note: Status Code Fields are determined by means outside the scope of LoPAR. Platform designs may define a hierarchy of aggregations in which lower levels by default inherit the energy management policy of their parent. Bytes 0:3 four byte Power Draw Status/Limit for the platform Bit 0: Power Draw Limit is hard/soft: 0 = Soft, 1 = Hard Bits 1:7 Reserved. Bits 8:31 unsigned binary Power Draw Limit times 0.1 watts The total processor energy consumed by the calling partition since boot in Joules times 2**-16. The value zero indicates that the platform does not support reporting this parameter. The total memory energy consumed by the calling partition since boot in Joules times 2**-16. The value zero indicates that the platform does not support reporting this parameter. The total I/O energy consumed by the calling partition since boot in Joules times 2**-16. The value zero indicates that the platform does not support reporting this parameter. Semantics: Place the partition’s performance parameters for the calling virtual processor’s partition into the respective registers: R4: Energy Management Status Codes R5: Power Draw Limits (Platform and Group) R6: Power Draw Limits (Pool and Partition) R7: Partition Processor Energy Consumption R8: Partition Memory Energy Consumption R9: Partition I/O Energy Consumption Return H_Success. R1--1. For the PEM option: The platform must implement the H_GET_EMP hcall() following the syntax and semantics of .

H_BEST_ENERGY This hcall() returns a hint to the caller as to the probable impact toward the goal of minimal platform energy consumption for a given level of computing capacity that would result from releasing or activating various computing resources. The returned value is a unitless priority, the lower the returned value; the more likely the goal will be achieved. The accuracy of the returned hint is implementation dependent, and is subject to change based upon actions of other partitions; thus the implementation can only provide a “best effort” to be “substantially correct”. Implementation dependent support for this hcall() and supported resource codes might change during partition suspension as in partition hibernation or migration; the client program should be coded to gracefully handle H_Function, H_UNSUPPORTED, and H_UNSUPPORTED_FLAG return codes. H_BEST_ENERGY may be used in one of two modes, “inquiry” or “ordered” specified by the setting of bit 54 of the eflags parameter. It is intended that ordered mode be used when the client program is largely indifferent to the specific resource instance to be released or activated. In ordered mode, H_BEST_ENERGY returns a list of resource instances in the order from the best toward worst to choose to release/activate to achieve minimal energy consumption starting with an initial resource instance in the ordered list (if the specified initial resource is the reserved value zero the returned list starts with the resource having the greatest probability of minimizing energy consumption). It is intended that inquiry mode be used when the client program wishes to compare the energy advantage of making a resource selection from among a set of candidate resource instances. In inquiry mode, H_BEST_ENERGY returns the unitless priority of releasing/acquiring each of the specified resource instances. It is expected that in the vast majority of cases, the client code will receive data on a sufficient number of resource instances in one H_BEST_ENERGY call to make its activate/release decision; however, in those rare cases where more information is needed, a series of H_BEST_ENERGY calls can be made to accumulate information on an arbitrary number of computing resource instances. Platforms may optionally support “buffered ordered” return data mode. If the platform supports “buffered ordered” return data mode, a “b” suffix appears at the end of the list that terminates the hcall-best-energy-1 function set entry. If the “buffered ordered” return data mode is supported the caller may specify the “B” bit in the eflags parameter and supply in P3 the logical address of a 4K byte aligned return buffer. The probable effects of a given resource instance selection might vary depending upon the intention of the client program to take other actions. These other actions include the ability to reactivate a released resource within a given time latency and number of resources the client program intends to activate/release as a group. The eflags parameter to H_BEST_ENERGY contains fields that convey hints to the platform of the client program intentions in these areas; implementations might take these hints into consideration as appropriate. The high order four (4) bytes of the eflags parameter contain the unsigned required reactivation latency in time base ticks (the reserved value of all zeros indicates an unspecified reactivation latency). Calling H_BEST_ENERGY with the eflags “refresh” flag (bit 54) equal to a one causes the hypervisor to compute the relative unitless priority value (1 being the best to activate/release with increasing numbers being poorer choices from the perspective of potential energy savings) for each instance of the specified resource that is owned by the calling partition. If the hypervisor can not distinguish a substantially different estimate for the various resource instances the call returns H_Not_Available. If the “refresh” flag is equal to a zero, the list as previously computed is used. Care should be exercised when using the non-refresh version to ensure that the state of the partition’s owned resource list has been initialized at some point and has not changed due to resource instance activation/release (including dynamic reconfiguration) activities by other partition threads else the results of the H_BEST_ENERGY call are unpredictable (ranging from inaccurate prediction values up to and including error code responses). The return values for H_BEST_ENERGY are passed in registers. Following standard convention, the return code is in R3. Register R4 contains the response count. If the call is made in “inquiry” mode the response count equals the number of non-zero requested resource instance entries in the call. If the call is made in “ordered” mode, the response count contains the number of entries in the ordered list from the first entry returned until the worst choice entry. If the response count is <= 8 (512 for ordered buffer mode) then the response count also indicates how many resource instances are being reported by this call, if the response count is >8 (512 for ordered buffer mode) then this call reports eight (512 for ordered buffer mode) resource instances. Each response consists of three fields: bytes 0 -- 2 are reserved, byte 3 contains the unitless priority for selecting the indicated resource instance, and bytes 4 -- 7 contain the resource instance identifier value corresponding to that passed in the “ibm,my-drc-index” property. In order to represent more accurately the significance of certain priority values relative to others, the platform might leave holes in the ranges of reported priority values. As an example there may be a gap of several priority numbers between the value associated with a resource that can be powered down versus one that can only be placed in an intermediate energy mode, and yet again another gap to a resource that represents a necessary but not sufficient condition for reducing energy consumption. Syntax: Parameters: (on entry) (on return) R3: Return code R4: Response Count Value <8 indicate the number of returned values in registers starting with R5. The contents of registers after the last returned value as indicated by the Response Count Value are undefined. R5 -- R12 Bytes 0-2 Reserved Byte 3: 1 - 255 -- unitless priority value relative to lowest total energy consumption for selecting the corresponding resource ID. Bytes 4-7 Resource instance ID to be used as input to dynamic reconfiguration RTAS calls as would the value presented in the “ibm,my-drc-index” property. Semantics: If the resource code in the eflags parameter is not supported return H_UNSUPPORTED If other binary eflags values are not valid then return H_UNSUPPORTED_FLAG with the specific value being (-256 - the bit position of the highest order unsupported bit that is a one); If the eflags parameter “refresh” bit is zero and the list has not been refreshed since the last return of H_Not_Available then return H_Not_Available. If the eflags parameter “refresh” bit is a one then: If energy estimates for the partition owned resources are substantially indistinguishable then return H_Not_Available. Assign a priority value to each resource of the type specified in the resource code owned by the calling partition relative to the probable effect that selecting the specified resource to activate/release (per eflags code) within the specified latency requirements would have on achieving minimal platform energy consumption. (1 being the best increasing values being worse - implementations may choose to use an implementation dependent subset of the available values) Order the specified resources owned by the calling partition starting with those having a priority value of 1; setting the resource pointer to reference that starting resource. If the eflags parameter bit 54 is a one (“ordered”) then If P2 == 0 then set pointer to best resource in ordered list Else If P2 <> the drc-index of one of the resources in the ordered list then return H_P2 Else set pointer to the resource corresponding to P2 Set R4 to the number of resources in the ordered list from the pointer to the end If eflags “B” bit == 0b0 then /* this assumes that the ordered buffer option is supported */: If R4 > 8 set count to 8 else set count to R4 Load “count” registers starting with R5 with the priority value and resource IDs of the “count” resource instances from the ordered list starting with the resource instance referenced by “pointer”. Else If P3 does not contain the 4K aligned logical address of a calling partition memory page then return H_P3 If R4 > 512 set count to 512 else set count to R4 Load “count” 8 byte memory fields starting with logical address in R3 with the priority value and resource IDs of the “count” resource instances from the ordered list starting with the resource instance referenced by “pointer”. Return H_SUCCESS Else /* “inquiry” mode */ Set R4 to zero For each input parameter P2 -- P9 or until the input parameter is zero If the input parameter Px <> the drc-index of one of the resources in the ordered list then return H_Px Fill in byte 3 of the register containing Px with the priority value of the resource instance corresponding to the drc-index (bytes 4 -- 7) of the register. Increment R4 Return H_SUCCESS R1--1. For the PEM option: The platform must implement the H_BEST_ENERGY hcall() following the syntax and semantics of .

Platform Facilities This section documents the hypervisor interfaces to optional platform facilities such as special purpose coprocessors.

H_RANDOM If the platform supports a random number generator platform facility the “ibm,hypertasfunctions” property of the /rtas node contains the function set specification “hcall-random” and the following hcall() is supported. Syntax:

Co-Processor Facilities If the platform supports a co-processor platform facility the “ibm,hypertas-functions” property of the /rtas node contains the function set specification “hcall-cop” and the following hcall()s are supported. For asynchronous coprocessor operations the caller may either specify an interrupt source number to signal at completion or the caller may poll the completion code in the CSB. The hypervisor and caller need to take into account the processor storage models with explicit memory synchronization to ensure that the rest of the return data from the operation is visible prior to setting the CSB completion code, and that any operation data that might have been fetched prior to the setting of the CSB completion code is discarded. Note: The H_MIGRATE_DMA hcall() does not handle data pages subject to co-processor access, it is the caller’s responsibility to make sure that outstanding co-processor operations do not target pages that are being migrated by H_MIGRATE_DMA.

H_COP_OP The architectural intent of this hcall() is to initiate a co-processor operation. Co-processor operations may complete with either synchronous or asynchronous notification. In synchronous notification, all platform resources associated with the operation are allocated and released between the call to H_COP_OP and the subsequent return. In asynchronous notification, operation associated platform resources may remain allocated after the return from H_COP_OP, but are subsequently recovered prior to setting the completion code in the CSB. For the partition migration option no asynchronous notification operation may be outstanding at the time the partition is suspended. Syntax: Syntax: Flags: Reserved (bits 0-- 38) “Rc” (bit 39) On Asymmetric Encryption operations the “Rc” bit indicates that the high order 16 bits of the “in” parameter contain the “Rc” field specifying the encoded operand length while the remainder of the “in” and “inlen” parameter bits are reserved and should be 0b0 Notification of Operation (bits 40-- 41): 00 Synchronous: In this mode the hypervisor synchronously waits for the coprocessor operation to complete. To preserve Interrupt service times of the caller and quality of service for other callers, the length of synchronous operations is restricted (see inlen parameter). 01 Reserved 10 Asynchronous: In this mode the hypervisor starts the coprocessor operation and returns to the caller. The caller may poll for operation completion in the CSB. 11 Async Notify: In this mode the hypervisor starts the coprocessor operation as with the Asynchronous notification above however the operation is flagged to generate a completion interrupt to the interrupt source number given in the “ibm,copint” property. When the interrupt is signaled the caller may check the operation completion status in the CSB. Interrupt descriptor index for Async Notify (bits 42-- 55) FC field: The FC field is the co-processor name specific function code (bits 56-- 63) Resource identifier (bits 32-- 63(as from the “ibm,resource-id” property)) in/inlen and out/outlen parameters: If the *len parameter is non-negative; the respective in/out parameter is the logical real address of the start of the respective buffer. The starting address plus the associated length may not extend beyond the bounds of a 4K page owned by the calling partition. For synchronous notification operations, the parameter values may not exceed an implementation specified maximum; in some cases these are communicated by the values of the “ibm,maxsync-cop” property of the device tree node that represents the co-processor to the partition. If the *len parameter is negative; the respective in/out parameter is the logical real address of the start of a scatter/gather list that describes the buffer with a length equal to the absolute value of the *len parameter. The starting address of the scatter/gather list plus the associated length may not extend beyond the bounds of a 4K page owned by the calling partition. Further the scatter/gather list shall be a multiple of 16 bytes in length not to exceed the value of the “ibm,max-sg-len” property of the device tree node that represents the coprocessor to the partition. Each 16 byte entry in the scatter gather list consists of an 8 byte logical real address of the start of the respective buffer segment. The starting address plus the associated length may not extend beyond the bounds of a 4K page owned by the calling partition. For synchronous notification operations, the summation of the buffer segment lengths for the in scatter/gather list may be limited; in some cases these limitations are communicated by the value of the “ibm,max-sync-cop” property of the device tree node that represents the coprocessor to the partition. csbcpb: logical real address of the 4K naturally aligned memory block used to house the co-processor status block and FC field dependent co-processor parameter block. Output parameters on return R3 contains the standard hcall() return code: if the return code is H_Success then the contents of the 4K naturally aligned page specified by the csbspb parameter are filled from the hypervisor csb and cpb with addresses converted from real to calling partition logical real Semantics: The hypervisor checks that the resource identifier parameter is valid for the calling partition else returns H_RH_PARM. The hypervisor checks that for the coprocessor type specified by the validated resource identifier parameter there are no non-zero reserved bits within the function expansion field of the flags parameter else returns H_UNSUPPORTED_FLAG for the highest order non-zero unsupported flag. If the operation notification is asynchronous, check that there are sufficient resources to initiate and track the operation else return H_Resource. The hypervisor checks that the flag parameter notification field is not a reserved value and FC field is valid for the specified coprocessor type else returns H_ST_PARM If the notification field is “synchronous” the hypervisor checks that the FC field is valid for synchronous operations else return H_OP_MODE. The hypervisor builds the CRB CCW field per the coprocessor type specified by the validated resource identifier parameter and by copying the coprocessor type defined number of FC field bits from the low order flags parameter FC field to the corresponding low order bits of CCW byte 3. If the resource ID is an asymmetric encryption then (If the Flags Parameter “Rc” bit is on then check the High order 16 bits of the “in” parameter for a valid “Rc” encoding and transfer to the CRB starting at byte 16 else return H_P2) else Validate the inlen/in parameters and build the source DDE Verify that the “in” parameter represents a valid logical real address within the caller’s partition else return H_P2 If the “inlen” parameter is non-negative: Verify that the logical real address of (in + inlen) is a valid logical real address within the same 4K page as the “in” parameter else return H_P3. If the operation notification is synchronous verify that the combination of parameter values request a sufficiently short operation for synchronous operation else return H_TOO_BIG. If the “inlen” parameter is negative: Verify that the absolute value of inlen meet all of the follow else return H_P3: Is <= the value of “ibm,max-sg-len” Is an even multiple of 16 That in + the absolute value of inlen represents a valid logical real address within the same 4K caller partition page as the in parameter. Verify that each 16 byte scatter gather list entry meets all of the following else return H_SG_LIST: Verify that the first 8 bytes represents a valid logical real address within the caller’s partition. Verify that the logical real address represented by the sum of the first 8 bytes and the second 8 bytes is a valid logical real address within the same 4K byte page as the first 8 bytes. If the operation notification is synchronous verify that the sum of all the scatter gather length fields (second 8 bytes of each 16 byte entry) request a sufficiently short operation for synchronous operation else return H_TOO_BIG. For the Shared Logical Resource Option if any of the memory represented by the in/inlen parameters have been rescinded then return H_RESCINDED. Fill in the source DDE list from the converted the in/inlen parameters. Validate the outlen/out parameters and build the target DDE Verify that the “out” parameter represents a valid logical real address within the caller’s partition else return H_P4 If the “outlen” parameter is non-negative verify that the logical real address of (out + outlen) is a valid logical real address within the same 4K page as the “out” parameter and for symmetric cryptography operations that outlen => inlen else return H_P5. If the “outlen” parameter is negative: Verify that the absolute value of outlen meet all of the follow else return H_P5: Is <= the value of “ibm,max-sg-len” Is an even multiple of 16 That out + the absolute value of outlen represents a valid logical real address within the same 4K caller partition page as the out parameter Verify that each 16 byte scatter gather list entry meets all of the following else return H_SG_LIST: Verify that the first 8 bytes represents a valid logical real address within the caller’s partition. Verify that the logical real address represented by the sum of the first 8 bytes and the second 8 bytes is a valid logical real address within the same 4K page as the first 8 bytes. Accumulate the sum of the second 8 bytes of each scatter gather list entry. Verify that for symmetric cryptography operations the accumulated sum of the second 8 bytes of each scatter gather list entry =>the input data length else return H_P5. For the Shared Logical Resource Option if any of the memory represented by the out/outlen parameters have been rescinded then return H_RESCINDED. Fill in the destination DDE list from the converted the out/outlen parameters. If the operation notification is asynchronous then verify that the input and output buffers do not overlap else return H_OVERLAP (makes the operations transparently restartable) Check that the csbcpb parameter is page aligned within the calling address space of the calling partition else return H_P6 If the operation specifies a CPB and the specified CPB is invalid for the operation then return H_ST_PARM. Set the CRB CSB address field & C bit to indicate a valid CCB If the operation notification is asynchronous notify, then: Check that the flags parameter interrupt index value is within the defined range for the validated rid and is not currently in use for another outstanding COP operation else return H_INTERRUPT. Set the CRB CM field to command a completion interrupt,. Set the job id field in the Co-processor Completion Block to command the signaling via the interrupt source number contained the interrupt specifier indicated by the interrupt index value. For the CMO option, if the number of entitlement granules pinned for this operation causes the partition memory entitlement to be exhausted then return H_NOT_ENOUGH_RESOURCES; else pin and record the entitlement granules used by this operation, and increment the partition consumed memory entitlement for the number of entitlement granules pinned for this operation. Set the completion code field in the passed (via csbcpb parameter) CSB to invalid (it is subsequently set to valid at the end of the operation just after the rest of the contents of the 4k naturally aligned page specified by the csbcpb parameter are filled). Issue icswx If busy response to icswx implementation dependent (may be null) retry after backoff based upon some usage equality/priority mechanisms else return H_Busy. If the operation notification is asynchronous then Return H_Success Wait for completion posting in CSB (CSB valid bit. 1) The contents of the 4K naturally aligned page specified by the csbcpb parameter are filled from the hypervisor csb and cpb with addresses converted from real to calling partition logical real Return H_Success.

H_STOP_COP_OP The architectural intent of this hcall() is to terminate a previously initiated co-processor operation. Syntax: Semantics: Check the rid parameter for validity for the caller else return H_RH_PARM If any reserved flags parameter bits are non zero then return H_Parameter. Check the csbcpb parameter for pointing within the caller’s partition and 4K aligned else return H_P3 For the shared logical resource option if the csbcpb parameter references a rescinded shared logical resource then return H_RESCINDED If the csbcpb parameter is not associated with an outstanding coprocessor operation then return H_NOT_ACTIVE. Send a kill operation to the coprocessor handling the outstanding operation Wait for the outstanding kill operation to complete. For the CMO option, unpin any entitlement granules still pinned for this operation and decrement the consumed partition memory entitlement for the number of entitlement granules pinned for this operation. Return H_Success.

Memory Usage Instrumentation Option (MUI) The MUI Option enables the platform to generate statistics on page reference affinity, age, access rate, and reference history pattern. Client programs can query and act upon this information in order to guide decisions for improving memory utilization and placement in a system. The MUI Option consists of a number of distinct per page measures as described in . Memory Usage Instrumentation Measures Name Abbreviation Description Reference History Bit Array HBA Measures dormancy patterns – a list of bits each representing a time interval, if the bit is a one the page was reference during the corresponding interval. Page Table Entry Update Time PUT The timestamp of the last HBA interval during which the corresponding page experienced a TLB miss during a reference. Access Count Array ACA The count of the corresponding page access used to compute page access rate. Page Age Counter PAC A saturating count representing the length of time since the page statistics were reset. Page Age Granule PAG The update cycle period of the Page Age Counter in seconds. Affinity Log Array ALA A sampled list of affinity domains that have accessed the page. Affinity Log Sample Period ALSP The period that each Affinity Log Array entry represents in seconds.

Client programs monitor and manage the MUI state through the extensions to the H_ENTER (see ), H_RETURN_PAGEINFO, H_MEMSTAT_CTRL, H_RESET_MEMSTATS, and H_BULK_READ_HBA hcall()s. The data returned by the H_RETURN_PAGEINFO and H_BULK_READ_HBA hcall()s generally reflect actual values. However, should a logical page be faulted back into a partition when Active Memory Sharing (AMS) is in use, its MUI state is cleared/set to a fixed value. MUI state is also subject to loss during events that move physical memory such as: dynamic reconfiguration, partition mobility, and fail-over. bit for managing the MUI behavior of pages are detailed below in . MUI Option Flags in Page Frame Table Access flags Detailed Description Bit(s) Name Encoding Description 43:44 Reference History Bit Array (HBA) 0b00 Disable HBA updates 0b01 Enable HBA updates, set PUT to previous time, but do not change current HBA content (as in adding an alias to a logical real page) 0b10 Enable HBA updates, set PUT to previous time, and set the HBA to the configured Initial Setting. 0b11 Reserved – return H_UNSUPPORTED_FLAG 45 Affinity-Clear Resets Affinity Log Array when entering a PTE. 46 Page-Age-Clear Resets Page Age Counter when entering a PTE.

The parameters to Memory Usage Instrumentation hcall()s that specify a given logical page or page range take the form of an index into the partition’s logical real memory space as if it were a set of 4K pages, the logical page index being the logical real address of the starting byte of the 4K page, right shifted 12 bits. It is expected that the MUI function will evolve over time, as will the syntax of the MUI hcalls(). The following requirements ensure forward compatibility over this expected evolution. R1--1. For the MUI option: the platform must for all MUI hcall()s fill all reserved return parameter registers with all zeros. R1--2. For the MUI option: to avoid future incompatibility, the caller of MUI hcall()s must ignore the contents of all reserved return parameter registers. R1--3. For the MUI option: to avoid future incompatibility, the caller of MUI hcall()s must fill all reserved input parameter fields with zeros.

H_MEMSTAT_CTRL This hcall() configures Memory Usage Instrumentation, and returns the current configuration. Note supplying a value of all zeros returns the current configuration settings of the Memory Usage Instrumentation facility. Syntax: Parameters: Input: R4: flags Output: R3: Return code R4: Config R5::R10: Reserved Memory Usage Instrumentation Configuration encoding flags/config 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 R R R R R R R R R R R R R R R R Flags = RRRRRRRR config = PAG Flags = RRRRRRRR config = ALSP 1 R R R R R R R R R R R R ALA ACA HBA R R Initial Setting HUC

Encoded values for Reference History Bit Array (HBA) (flags/config) Bit 0 Bit 1 Comment 0 0 No operationDo not change the HBA configuration or update cycle time, simply return current settings. 0 1 Disable HBA updates 1 0 Enable an HBA update on the first TLB miss per HBA update cycle, and set the HBA per the configured Initial Setting. (Note: If the platform does not support this setting, H_MEMSTAT_CTRL returns H_RT_PARM, and the value of the HBA field in the returned memory instrumentation configuration “out” parameter is 0b00.) 1 1 Enable an HBA update on the first access* per HBA update cycle and force a TLB miss per HBA update cycle, and set the HBA per the configured Initial Setting. (Note: If the platform does not support this setting, H_MEMSTAT_CTRL returns H_RT_PARM, and the value of the HBA field in the returned memory instrumentation configuration “out” parameter is 0b00.)

Implmentation Note: This may be approximated by performing a TLBIA once per HBA update cycle; thus forcing a TLB miss on the first subsequent page access. The Initial Setting field (flags/config): is a 6 bit field that defines the number of high order HBA bits that are preloaded to a 1 when the HBA is initialized (for instance when the page is assigned a new virtual address through H_ENTER). This field allows the software to bias the page statistics so that the page will not be chosen as a victim before it can establish its own usage statistics. HBA Update Cycle Field (HUC) (flags/config): is a 6 bit field that defines the update cycle period in microseconds multiplied by the power of two specified in the 6 bit field. The range of supported HUC values is given in the “ibm,mui-ranges” property. Note: If the platform does not support this setting, or the supplied value, H_MEMSTAT_CTRL returns H_RT_PARM, and the value of the HUC field in the returned memory instrumentation configuration “config” parameter is the reserved value 0b111111. Encoded values for Access Count Array (ACA) (flags/config) Bit 0 Bit 1 Comment 0 0 No operationDo not change, simply return current setting of ACA configuration. 0 1 Disable ACA updatesNote: This setting will prevent the partition from seeing ACA data, however, the platform may still accumulate such data for other purposes. 1 0 Enable ACA updatesNote: If the platform does not support this setting, H_MEMSTAT_CTRL returns H_RT_PARM and the value of the ACA field in the returned memory instrumentation configuration “config” parameter is 0b00. On enable the platform is not required to initialize the counters except to preclude a covert channel as in the case of page reassignment between partitions.) 1 1 Reserved:Note: If the caller supplies this value, H_MEMSTAT_CTRL returns H_RT_PARM and the value of the ACA field in the returned memory instrumentation configuration “config” parameter is 0b11.

Encoded values for Affinity Log Array (ALA) (flags/config) Bit 0 Bit 1 Comment 0 0 No operationDo not change, simply return current setting of ALA configuration 0 1 Disable ALA updatesNote: This setting will prevent the partition from seeing ALA data, however, the platform may still accumulate such data for other purposes. 1 0 Enable ALA updatesNote: If the platform does not support this setting, H_MEMSTAT_CTRL returns H_RT_PARM and the value of the ALA field in the returned memory instrumentation configuration “config” parameter is 0b00. On enable the platform is not required to initialize the counters except to preclude a covert channel as in the case of page reassignment between partitions.) 1 1 Reserved:Note: If the caller supplies this value, H_MEMSTAT_CTRL returns H_RT_PARM and the value of the ALA field in the returned memory instrumentation configuration “config” parameter is 0b11.

The Page Age Granule (PAG) and Affinity Log Sample Period (ALSP) fields (config only): are 8 bit fields that define these periods in seconds. Note: On input these are reserved fields, any value other than all zeros causes H_MEMSTAT_CTRL to return H_UNSUPPORTED_FLAG. Semantics: Returns H_UNSUPPORTED_FLAG if a “flags” parameter reserved bit is non-zero (contents of R4 are undefined). Returns H_RT_PARM if a defined field of the “flags” parameter is either not supported or invalid along with the current memory usage instrumentation configuration settings in register R4. Otherwise sets requested memory usage instrumentation configuration, returns H_Success along with the current memory usage instrumentation configuration settings in register R4.

H_RESET_MEMSTATS Resets page age, affinity log, and/or PUT/HBA for up to 6 logical real pages as specified by the logical page index parameters. Syntax: Parameters: Input: flags: bits 0::60: Reserved bit 61: HBA (History Bit Array): Set Bit Array to configured Initial Setting. bit 62: PAC (Page Age Counter) bit 63: ALA (Affinity Log Array) lpx1::lpx6: Logical page index(s) to be used Output: R3: Return code R4::R10: Reserved Semantics: If (Flags AND HBA) and Reference History Bit Array not enabled, returns H_Not_Available. If (Flags AND ALA) and the Affinity Log Array not enabled, returns H_Not_Available. If (Flags AND PAA) and reference rate array not enabled, returns H_Not_Available. If (Flags AND ACA) and reference rate array not enabled, returns H_Not_Available. Return H_Success.

H_RETURN_PAGEINFO The H_RETURN_PAGEINFO hcall() returns the page usage information for a range of logical pages or a list of specific logical pages. The results are returned in a 4K aligned buffer, the data for each logical page occupying one 32 byte record, all other contents of the buffer are volatile and undefined. The range of logical pages is restricted to a single LMB boundary (the list of specific logical pages does not have his restriction). If this restriction is violated, the H_RETURN_PAGEINFO hcall() returns H_P4. The RETURN of a page range might take longer than is allowable for a single call, or it might select more pages than can be contained in the return buffer. Should either of these cases happen the H_RETURN_PAGEINFO hcall() returns H_CONTINUE, along with the current results, and the value of the lpx0 parameter for continuing the RETURN on a subsequent call from the termination point of the prior call. The caller may specify a filter to further qualify the pages for which data is returned. For the CMO option if the logical page has been paged out by the platform, the page is filtered out of the returned data. Syntax: Parameters: Input: flags: H_MEMSTAT_CTRL Flag Layout 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 Ar Accesses / Second 1 RESERVED R

H_MEMSTAT_CTRL Flag Sub-field Layout Word Byte Bit Name Comment 0 0 0-1 Ar Access Rate Filter Control 0b00 Do not filter based upon Access Rate (ignore flags bits 2-31) 0b01 Reserved (Returns H_Parameter if this condition value is set) 0b10 Select if page access per second is greater than the value in bits 2::31. Note if the specified value is greater than the MUI facilities’ maximum access rate capacity per the “ibm,mui-ranges” property this condition will not be met. 0b11 Reserved (Returns H_Parameter if this condition value is set) 2-7 Accesses per second 1-3 All 1 0-2 All Reserved Reserved (Returns H_Parameter if this condition value is set) 3 0-6 7 R 0: Returns page usage data for up to 5 pages specified one each in parameters lpn0::lpn4 1: Returns page usage data for the range from the logical page in parameter lpn0 through lpn1

buffer: 4kB-aligned output buffer Implementation Note: This buffer is output only and may be initialized with dcbz instructions to preclude memory error handling. lpx0: The logical page number index for the first page. lpx1: For page Range option, the last logical page number index. For the list option, and not the list terminator value (-1) continue to return page usage data for the logical page number specified, else terminate the call. lpx1::lpx4: For the list option, and not the list terminator value (-1) continue to return page usage data for the logical page number specified, else terminate the call Output: R3: Return Code R4: Number of matching entry records in destination buffer R5: For the range RETURN option, if the return code is H_CONTINUE this value is the value of the lpn0 parameter for a subsequent call to continue the RETURN. R6::R10: Reserved Buffer Record Format Field Name Byte Offset Description lpx 0::7 The logical page number index of the selected page access_rate 8::15 Accesses per second affinity_log 16::23 Eight single byte integers. One integer for each of the past eight ALSP intervals. The non-zero integer value representing one of the masters that referenced the page during the period. A zero value indicating that valid data is not available for the represented period. The associativity list for the referencing master is found in the “ibm,muiassociativity-mapping” property. Reserved 24::29 flags 30 Bit 0 is a 1 if the page is AMS paged out (CMO option) else 0 Bit 1 is a 1 if the access_rate value is valid else 0 Bit 2 is a 1 if the affinity_log value is valid else 0 Bit 3 is a 1 if the age value is valid else 0 Bits 4::7 Reserved age 31 Page age in units of the Page Age Granule

Semantics: If any reserved flags bits are set return H_Parameter. Returns H_Function if function does not exist. Returns H_P2 if buffer is not 4k aligned or if the logical page number is outside of the caller’s range. count = 0; /* number of entries matching in dest */ lpx1) then return H_Success else return H_CONTINUE; }]]>

H_BULK_READ_HBA H_BULK_READ_HBA returns the Reference History Bit Array entry for multiple pages in one call. The returned values present the reference history for up to 64 HUC intervals. Starting with the high order bit representing the most recent HUC interval, the returned value contains a one in the corresponding bit position if the page was accessed during that interval. Due to hardware limitations, the system might not have data for a full 64 intervals; in that case low order bits are zero filled. Syntax: Parameters: Input: flags: bits 0::63: Reserved lpx1::lpx6: The logical page number index to be used to index into appropriate position in HBA. HBA value for a given LPX is returned in the same argument register. Output: R3: Return code R4: Reserved R5: HBA corresponding to lpx1 R6: HBA corresponding to lpx2 R7: HBA corresponding to lpx3 R8: HBA corresponding to lpx4 R9: HBA corresponding to lpx5 R10: HBA corresponding to lpx6 Semantics: If Reference History is not enabled, returns H_Function. If a reserved flags bit is set return appropriate H_UNSUPPORTED_FLAG value. Return H_Success.

Coherent Platform Facilities This section documents the hypervisor interface to optional coherent platform facilities. If the platform supports a coherent platform facility, the “ibm,hypertas-functions” property of the /rtas node contains the function set specification “hcall-ca” and the following hcall()s are supported.

H_ATTACH_CA_PROCESS The architectural intent of this hcall is to attach a process element to a coherent platform function. The process element describes the environment in which a coherent platform function will operate for a given workload. Syntax: Process Token Format Bytes 0-3 Bytes 4-7 Platform firmware use (opaque to OS) CAIA process element index (128 byte offset into Scheduled Process Area)

Parameters: uint64 unit-address: Unit Address per the device tree “reg” property, element 0, of the coherent platform function uint64 process-element-struct: Logical real address of the process element structure. This memory must remain pinned and unchanged throughout the duration of the H_ATTACH_CA_PROCESS call. All fields in the structure have big-endian byte ordering and MSB 0 bit ordering. uint64 continue-token: Used to continue a process attach if H_Busy is returned. Set to zero on first call. If H_Busy is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. Verify the process-element-struct parameter: Verify that the process-element-struct is 8 byte aligned and does not cross a 4096 byte boundary, else return H_Parameter. Verify that the Process Element structure version is supported, else return H_Parameter. Verify that if the isPrivilegedProcess bit is set in the process element, that the coherent platform function is allowed (via “ibm,privileged-function” F property), else return H_Authority. Verify that if the aurpValid bit is set to 1, that the coherent platform function supports AUR (via “ibm,supports-aur” OF property), else return H_Parameter. Verify that if the csrpValid bit is set to 1, that the coherent platform function supports CSRP (via “ibm,supports-csrp” OF property), else return H_Parameter. Verify there is adequate space to attach the process for the coherent platform function else return H_Resource. Verify that the coherent platform function is in a state that allows attaching of new processes and if necessary has been downloaded via H_DOWNLOAD_CA_FUNCTION, else return H_State. Verify that the sum of the pslVirtualIsn and application virtual ISN values are greater than or equal to the “ibm,min-ints-per-process” property for the coherent-platform-function and less than or equal to the “ibm,max-ints-per-process” property for the coherent-platform-function and the attaching of this process does not violate the “ibm,max-ints” property for the coherent-platform-function, else return H_Parameter. Verify that the pslVirtualIsn is not already in use by another coherent platform function under the coherent platform facility, if so return H_Resource. Validate that the application virtual ISN values are valid and not in use by another coherent platform function under the coherent platform facility, and do not collide with the specified pslVirtualIsn, if so return H_Resource. Application virtual ISN values are calculated by adding the base virtual ISN value found in the interrupt-ranges property of the parent coherent platform facility node to the relative offset (zero-based) of a bit in the bitmap that is set to 1. These values are programmed into corresponding the CAIA process element structure. Verify that all the virtual interrupts can be mapped into the CAIA process element, else return H_Resource. It may be possible to attempt to attach the process after detaching existing processes. Verify that the virtual interrupts provided will fit into the process element entry, if not return H_Parameter. Disable the virtual interrupts provided in the process-element-struct. The partition must use ibm,set-xive (with priority less than 0xFF) to enable the virtual interrupt source after H_ATTACH_CA_PROCESS completes successfully. Select a process element to use for the coherent platform function and performs the procedure to attach a process as defined by the CAIA. During this procedure, H_Busy or H_LongBusy will be returned if hcall time limits are exceeded. Once the process element is attached as defined by the CAIA, return H_Success, R4 contains the process token and if “ibm,process-mmio” is set to 1, R5 is the MMIO address, R6 is the MMIO length. Following a reset of the coherent platform facility or coherent platform function, platform firmware guarantees that the upper 4 byte portion of the returned process token will be different than it was for any process token returned since the previous reset.

H_DETACH_CA_PROCESS The architectural intent of this hcall is to detach a process element from a coherent platform function. This hcall will remove the workload or p rocess element that was attached using H_ATTACH_CA_FUNCTION. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform function uint64 process-token: process identifier token for the attached process returned in R4 on H_Success return from H_ATTACH_CA_PROCESS call. uint64 continue-token: Used to continue a process detach if H_Busy is returned. Set to zero on first call. If H_Busy is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. Verify that the process-token is currently an attached process to the coherent platform function, else return H_Parameter. Verify that the coherent platform function is in an error state that allows detaching processes, else return H_State. If “ibm,process-mmio” is set to 1, verify that there are no existing mappings in the page table for the process MMIO space, else return H_Resource. If the process is not completed or suspended, the process is terminated using the process terminate procedure in the CAIA. During this process the platform can return H_Busy or H_LongBusy and the OS is responsible for calling back until a non-busy return code is returned. Remove the process from the coherent platform function process element list according to the process remove procedure defined in the CAIA. During this process the platform can return H_Busy or H_LongBusy and the OS is responsible for calling back until a non-busy return code is returned. Invalidation of the SLB and TLB for the process being detached is performed. During this process the platform can return H_Busy or H_LongBusy and the OS is responsible for calling back until a non-busy return code is returned. If the hardware encounters an error while detaching the process, H_Hardware is returned. H_Success is returned.

H_CONTROL_CA_FUNCTION This H_CONTROL_CA_FUNCTION hypervisor call allows the partition to manipulate or query certain coherent platform function behaviors. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 operation: operation to perform to the coherent platform facility. Valid values are: Reset: operation = 1, perform a reset to the coherent platform function, this is used when the partition needs to reset the coherent platform function to a clean state. All attached processes and state are cleared by firmware after this reset. Suspend Process: operation = 2, suspend a process from being executed Resume Process: operation = 3, resume a process to be executed Read Error State: operation = 4, read the error state of the coherent platform function Get Error Buffer: operation = 5, collect the AFU error buffer for the coherent platform function. Get Function Configuration Record: operation = 6, collect configuration record for the coherent platform function Get Function Download Status: operation = 7, query to return download status of a programmable coherent platform function. Terminate Process: operation = 8, terminate the process before completion Collect VPD: operation = 9, collect VPD for the coherent platform function. Get Function Error Interrupts: operation = 11, read the function-wide error data based on an interrupt from “ibm,function-error-interrupt” Acknowledge Function Error Interrupts: operation = 12, acknowledge function-wide error data based on an interrupt from “ibm,function-error-interrupt” Get Error Log: operation = 13, retrieve the Platform Log ID (PLID) of an error log containing error data for the coherent platform function. This is used after a Temporary Unavailable or Permanently Unavailable Error State. uint64 parameter1: parameter 1 for operations, meaning changes based on the operation. uint64 parameter2: parameter 2 for operations, meaning changes based on the operation. uint64 parameter3: parameter 3 for operations, meaning changes based on the operation. uint64 parameter4: parameter 4 for operations, meaning changes based on the operation. uint64 continue-token: Used to continue an operation if H_Busy is returned. Set to zero on first call. If H_Busy is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Operation Parameters Reset None Suspect Process Parameter1 = process-token as returned from H_ATTACH_CA_PROCESS when process was attached. Resume Process Parameter1 = process-token as returned from H_ATTACH_CA_PROCESS when process was attached. Read Error State None Get Error Buffer Parameter1 = byte offset into error buffer to retrieve, valid values are between 0 and (ibm,error-buffer-size – 1) Parameter2 = 4K aligned real address of error buffer, to be filled in Parameter3 = length of error buffer, valid values are 4K or less Get Functional Configuration Record Parameter1 = # of configuration record to retrieve, valid values are between 0 and (ibm,#config-records – 1) Parameter2 = byte offset into configuration record to retrieve, valid values are between 0 and (ibm,config-record-size – 1) Parameter3 = 4K aligned real address of configuration record buffer, to be filled in Parameter4 = length of configuration buffer, valid values are 4K or less Get Function Download Status None Terminate Process Parameter1 = process-token as returned from H_ATTACH_CA_PROCESS when process was attached. Collect VPD Parameter1 = # of VPD record to retrieve, valid values are between 0 and (ibm,#config-records – 1) Parameter2 = 4K naturally aligned real buffer containing scatter/gather list entries. All fields in the scatter/gather list have big-endian byte ordering. Parameter3 = number of entries in the scatter/gather list, valid values are between 0 and 256 Get Function Error Interrupts None Acknowledge Function Error Interrupts Parameter1 = value to write to the function-wide error interrupt register Get Error Log None Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. If operation is Reset: If coherent platform function is in Temporarily Unavailable or Permanently Unavailable error state or is already performing a reset, return H_State. If partition is not allowed to perform a Reset (“ibm,privileged-function” property is 0 or not present), return H_Authority. If coherent platform function has “ibm,process-mmio” property set to 1 and partition has any page table mappings existing for the function, return H_Resource. If coherent platform function is in Normal error state, set to Disabled error state. Terminate and remove all process elements that were attached via H_ATTACH_CA_PROCESS. If the termination takes longer than is allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy are returned. If allowed, perform a reset (disable AFU, PSL suspend, PSL purge, TLB invalidate, SLB invalidate) of the coherent platform function using CAIA procedures. If the reset takes longer than is allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy are returned. After the reset, if the coherent-platform-function has the “ibm,programmable” property set to 1, a download is required via H_DOWNLOAD_CA_FUNCTION. The Get Function Download Status operation can be used to query the download state. If the coherent-platform-function does not have the “ibm,programmable” property or it is set to 0, the AFU is enabled. If the reset fails while communicating with the hardware, return H_Hardware. Reset the error log data for the Get Error Log operation. Set coherent platform function Error State to Normal and return H_Success If operation is Suspend Process: If the coherent platform function is not in a Normal Error State, return H_State. If the coherent platform function does not support suspending processes, return H_Function. If the process associated with the process token cannot be found, return H_Parameter. If the process is not able to be suspended or is already suspended, return H_State. The process associated with the process-token is suspended via the procedure defined in the CAIA. If the suspend takes longer than is allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy are returned. If the Suspend Process procedure encounters a hardware failure, return H_Hardware. Return H_Success. If operation is Resume Process: If the coherent platform function is not in a Normal Error State, return H_State. If the coherent platform function does not support resuming processes, return H_Function. If the process associated with the process token cannot be found, return H_Parameter. If the process not suspended or resume isn't possible, return H_State. The process associated with the process-token is resumed via the procedure defined in the CAIA. If the resume takes longer than is allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy are returned. If a hardware error occurs during the Resume Process operation, return H_Hardware. Return H_Success. If operation is Read Error State: Platform firmware checks the error state of the coherent platform function. If already in an error state, H_Success is returned and R4 contains the error state. Platform firmware checks for errors on the coherent platform function. If errors exist, error recovery is entered and H_Success is returned and R4 contains the error state. If operation is Get Error Buffer: If parameter2 does not describe a valid 4K aligned real address, return H_Parameter. If parameter3 is greater than 4K, return H_Parameter. If parameter1 plus parameter3 is greater than or equal to “ibm,error-buffer-size”, return H_Parameter. If the coherent platform function is in a Temporarily Unavailable or Permanently Unavailable state, return H_State. Platform firmware collects the error data buffer from the AFU descriptor associated with the coherent platform function and places it in the partition buffer described by parameter2 and parameter3. If the Get Error Buffer operation exceeds the time allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy is returned. If the error buffer cannot be read from the hardware due to a hardware problem, return H_Hardware. Return H_Success. If operation is Get Function Configuration Record: If parameter1 does not describe a valid configuration record number, return H_Parameter. If parameter3 does not describe a valid 4K aligned real address, return H_Parameter. If parameter4 is greater than 4K, return H_Parameter. If parameter2 plus parameter4 is greater than or equal to “ibm,config-record-size”, return H_Parameter. If the coherent platform function is not in a Normal Error State, return H_State. If platform firmware cannot retrieve the configuration record from the coherent platform function, return H_Function. If platform firmware cannot retrieve the configuration record due to the coherent platform function not in a downloaded state, r eturn H_NOT_AVAILABLE. Platform firmware collects the configuration record from the coherent platform function and places it in the partition buffer described by parameter3 and parameter4. The data is stored as a byte stream; the first byte in the buffer corresponds to byte 0 of the configuration record. If the Get Function Configuration Record operation exceeds the time allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy is returned. If the configuration record cannot be read from the hardware, due to a hardware problem, return H_Hardware. Return H_Success. If operation is Get Function Download Status: If coherent platform function does not support download, return H_Function. If the partition does not have the authority to get download status (“ibm,privilegedfunction” property is 0 or not present), return H_Authority. If the coherent platform function is not in a Normal or Disabled Error State, return H_State. Platform firmware returns the download status in R4, where 0 = no-download-found and 1 = download-found. Return H_Success. If operation is Terminate Process: If the coherent platform function is not in a Normal Error State, return H_State. If the coherent platform function does not support terminating processes, return H_Function. If the process associated with the process token cannot be found, return H_Parameter. If the process has already completed, return H_State. The process associated with the process-token is terminated via the procedure defined in the CAIA. If the attempt to terminate the process takes longer than is allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy are returned. If a hardware error occurs during the Terminate Process operation, return H_Hardware. Return H_Success. If the operation is Collect VPD: If parameter1 does not describe a valid VPD record number, return H_Parameter. If parameter2 does not describe a valid 4K aligned real address, return H_Parameter. If parameter3 is greater than 256, return H_Parameter. If a scatter/gather list entry specifies an invalid address, or specifies a buffer that crosses a page boundary, return H_SG_LIST. If the coherent platform function is not in a Normal Error State, return H_State. If platform firmware cannot retrieve the VPD from the coherent platform function, return H_Function. If platform firmware cannot retrieve the VPD due to the coherent platform function not in a downloaded state, return H_NOT_AVAILABLE. Platform firmware collects the VPD from the coherent platform function and places it in the partition buffer described by parameter2 and parameter3. The data will be truncated as necessary to fit in the provided buffer. The data is stored as a bytestream; the first byte in the buffer corresponds to byte 0 of the VPD. If the Collect VPD operation exceeds the time allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy is returned. If a hardware error occurs during the Collect VPD operation, r eturn H_Hardware. Return H_Success, and R4 is set to the length of the available VPD, which may be different than the amount of data actually stored in the partition buffer. It may also be different than the value reported in the “ibm,vpd-size” property, though it will not be greater than that. A length of 0 means no VPD has been provided for the coherent platform function. If the operation is Get Function Error Interrupts: If the coherent platform function is not in a Normal Error or Disabled State, return H_State. If the coherent platform function does not support Get Function Error Interrupts, return H_Function. If the Function Error Interrupts cannot be retrieved from the hardware, return H_Hardware. Platform firmware returns the value of Function Error Interrupts read from hardware in R4. Return H_Success. If the operation is Acknowledge Function Error Interrupts: If the coherent platform function is not in a Normal or Disabled Error State, return H_State. If the coherent platform function does not support Acknowledge Function Error Interrupts, return H_Function. Acknowledge Function Error Interrupts using the value in parameter1. If the Acknowledge Function Error Interrupts cannot be sent to the hardware, return H_Hardware. Return H_Success. If operation is Get Error Log: If the coherent platform function is not in Disabled or Permanently Unavailable Error State, return H_State. If applicable, platform firmware writes the Platform Log ID (PLID) in R4 for the error log that is associated with the cause of the Temporarily Unavailable or Permanently Unavailable Error State. This data is used to correlate errors between the platform owned resource and the coherent platform function. If there is no associated error log to reference, platform firmware writes zero to R4. Return H_Success. If operation is unknown, return H_Not_Found.

H_COLLECT_CA_INT_INFO The architectural intent of this hcall is to collect interrupt info about a coherent platform function after an interrupt occurred. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 process-token: process identifier token for the attached process returned in R4 on H_Success return from H_ATTACH_CA_PROCESS call. Return Values: R4 contains the PSL_DSISR_An register value defined in the CAIA on H_Success. R5 contains the PSL_DAR_An register value defined in the CAIA on H_Success. R6 contains the PSL_DSR_An register value defined in the CAIA on H_Success. R7 contains the PSL_PID_An in the upper 32 bits and PSL_TID_An register in the lower 32 bits. R8 contains the AFU_ERR_An register value defined in the CAIA on H_Success. R9 contains the PSL_ErrStat_An register value defined in the CAIA on H_Success. R10 contains a handle for the process element that incurred the fault on H_Success. Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. Verify that the process-token parameter is valid else return H_Parameter. Verify that the coherent platform function is in the proper state to read interrupt information else return H_State. Platform firmware reads the values of PSL_DSISR_An, PSL_DAR_An, PSL_DSR_An, PSL_DSR_An, PSL_PID_An, PSL_TID_An, AFU_ERR_An and PSL_ErrStat_An as defined by the CAIA and populates the return registers. AFU_ERR_An value is only valid if PSL_DSISR[AE] is 1 or PSL_SERR_An[AE] is 1. PSL_ErrStat_An value is only valid if PSL_DSISR[PE] is 1. If any of the reads fail from the hardware H_Hardware is returned and none of the return registers should be considered valid. H_Success is returned.

H_CONTROL_CA_FAULTS The architectural intent of this hcall is to control the operation of a coherent platform function after a fault occurs. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 operation: operation to perform to the coherent platform facility. Valid values are: Respond to page fault - PSL: operation = 1. Respond to page fault - AFU: operation = 2. uint64 parameter1: parameter 1 for operations, meaning changes based on the operation. uint64 parameter2: parameter 2 for operations, meaning changes based on the operation. uint64 parameter3: parameter 3 for operations, meaning changes based on the operation. uint64 parameter4: parameter 4 for operations, meaning changes based on the operation. Operation Parameters Respond to page fault - PSL Parameter1 = process-token as returned from H_ATTACH_CA_PROCESS Parameter2 = control-mask bits 0-59: reserved bit 60: acknowledge non-translation fault interrupt bit 61: continue execution, current translation fault is not resolved and must be retried at a later time bit 62: restart function and indicate address error bit 63: restart the transaction that caused the translation fault Parameter3 = reset-mask bit 0-62: reserved bit 63: reset fault bits for a PSL level process error (PSL_DSISR_An[PE] is set) Respond to page fault - AFU Parameter1 = process-token Parameter2 = process element handle returned from H_COLLECT_CA_INT_INFO. Parameter3 = effective address Parameter4 = resolution, valid values are: 0x0 -- Page Fault Resolved 0x1 -- Addressing Error 0x2 -- Protection Fault on a Read operation 0x3 -- Protection Fault on a Write operation Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. If operation is Respond to page fault - PSL: Verify that the process-token parameter is valid else return H_Parameter. Verify that the coherent platform function is in a valid state else return H_State. Using the control-mask set the corresponding bits in PSL_TFC_An as defined by CAIA. Only bits that are set are written. If no bits are set, no changes are performed. If the setting of the bits in the hardware encounters an error, return H_Hardware. If bit 63 of the reset-mask is set, clear the PSL_ErrStat_An bits by reading the register and writing back the value read. If this operation encounters an error with the hardware, return H_Hardware. Perform a read from PSL_TFC_An and place corresponding values in R4. If the read fails, return H_Hardware. Return H_Success with the following in R4: bits 0-60: reserved bit 61: function waiting to continue bit 62: address error pending bit 63: command reissue pending If operation is Respond to page fault - PSL: Verify that the process-token parameter is valid else return H_Parameter. Verify that the resolution parameter is valid else return H_Parameter. Verify that the coherent platform function is in a valid state else return H_State. Verify that the coherent platform function supports paged resolution response (via “ibm,supports-prr” OF property), else return H_Function. Write the effective address and resolution to the corresponding fields in the PRR registers of the AFU. If this operation encounters an error with the hardware, return H_Hardware. Return H_Success. If operation is unknown, return H_Not_Found.

H_DOWNLOAD_CA_FUNCTION The architectural intent of this hcall is to provide platform support for downloading an application image to the coherent platform function. The partition provides download data to the platform via an image scatter/gather list. The scatter/gather list can architecturally describe up to 1 megabyte of data (256 entries of 4096 bytes each). The OS must subdivide the application image into chunks that are each 1 megabyte or less in size, and call H _DOWNLOAD_CA_FUNCTION for each of those chunks. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 scatter-gather-list-address: 4K naturally aligned real buffer containing scatter/gather list entries. All fields in the scatter/gather list and all fields in the image header have big-endian byte ordering. uint64 num-scatter-gather-list-entries: number of entries in the scatter/gather list uint64 continue-token: Used to continue an operation if H_Busy or H_CONTINUE is returned. Set to zero on first call. If H_Busy or H_CONTINUE is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Image Scatter/Gather List Entry Format 8 byte logical real address of buffer 8 byte buffer length in bytes (max length is 4096 bytes)

Image Scatter/Gather List Format Logical real address of buffer 0 Buffer 0 length in bytes) Logical real address of buffer 1 Buffer 1 length in bytes) ... ... Logical real address of buffer N-1 Buffer N-1 length in bytes) Logical real address of buffer N Buffer N length in bytes)

Application Image Header, Version 1 Name Offset Length Description Version 0 2 Version of the AFU image header, value = 1 Function Number 2 2 Physical function number that the application uses Application ID 4 2 Application identifier Reserved 6 2 Set to zero. Vendor ID 8 2 PCI Vendor ID of the adapter owning the coherent platform function Device ID 10 2 PCI Device ID of the adapter owning the coherent platform function Subsystem Vendor ID 12 2 PCI Subsystem Vendor ID of the adapter owning the coherent platform function Subsystem ID 14 2 PCI Subsystem ID of the adapter owning the coherent platform function Image Offset 16 8 Offset to the application image bitstream Image Length 24 8 Length of the application image bitstream Verification Type 32 2 Type of verification required for image: 1 = Bounds Check All other values reserved Reserved 34 6 Set to zero CAIA Version 40 2 Minimum CAIA Version required by this application image PSL Revision 42 2 Minimum PSL Revision required by this application image Reserved 44 84 Set to zero Image Bitstream X Y Application image bitstream, where X = Image Offset and Y = Image Length

Return Values: R4 on H_Busy or H_LongBusy or H_CONTINUE contains the continue-token to be used on the next call Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. If the coherent-platform-facility cannot be downloaded at this time due to a resource constraint, H_Resource is returned. If the coherent platform facility does not support download, return H_Function. If the coherent platform function is already downloaded, or if a download is in progress, return H_State. If the partition does not have the authority to perform download (“ibm,privileged-function” property is 0 or not present), return H_Authority. If the coherent platform facility is in a Temporary Unavailable Error State or has attached processes, return H_State. If the scatter-gather-list-address does not describe a 4K byte naturally aligned buffer, return H_Parameter. If the Application Image Header version is not supported by platform firmware, return H_BAD_DATA. If necessary, platform firmware disables the coherent platform facility from operation. For each entry in the scatter/gather list described by scatter-gather-list-address: Platform firmware validates address and length in the scatter/gather list entry. The buffer described should not cross a 4K page boundary. If invalid, returns H_SG_LIST. Platform firmware copies data from the scatter/gather list entry to the platform firmware buffer. Platform firmware verifies the image bitstream data chunk in the platform buffer. If platform firmware determines the image bitstream data chunk is not valid, return H_BAD_DATA. During this operation, H_Busy or H_LongBusy can be returned due to hcall maximum time limits, the partition should call back, until a non-busy return code is returned. Platform firmware performs the download for the coherent platform facility, using the image bitstream data chunk. During this operation, H_Busy or H_LongBusy can be returned due to hcall maximum time limits, the partition should call again, until a non-busy return code is returned. If the coherent platform facility does not accept the download of the image bitstream data chunk or an error occurs while communicating with the hardware, H_Hardware is returned. If hcall time limit is exceeded, but more data is left to copy in the current scatter/gather list, H_Busy or H_LongBusy is returned. The partition should call back with the current scatter/gather list. Once every entry in the current scatter/gather list is copied, platform firmware returns H_CONTINUE. The partition then calls back with a new scatter/gather list for the next chunk of image data and the previous steps are repeated for each new list. This is repeated as long as H_CONTINUE is returned. The CAIA AFU descriptor is read for the downloaded AFU, if any fields in the AFU descriptor are not compatible with the PSL, H_UNSUPPORTED is returned. If the Download operation completes successfully, if necessary, platform firmware re-enables coherent platform function for operation. H_Success is returned. Any error in the above steps will cause the download to be aborted. The partition must retry H_DOWNLOAD_CA_FUNCTION, starting with the Application Image header in order to complete the download. After H_DOWNLOAD_CA_FUNCTION is performed, the partition should call ibm,update-nodes and ibm,update-properties to receive the current configuration for the coherent platform facility. When H_DOWNLOAD_CA_FUNCTION is first called, some AFU or adapter resources may be reserved for use during the download sequence, which may span multiple H_DOWNLOAD_CA_FUNCTION calls, until the image download is complete as indicated by a return of H_SUCCESS. When H_CONTINUE is returned, indicating that more data is needed for the complete AFU image, the OS must call H_DOWNLOAD_CA_FUNCTION again within 1 milliseconds, or the download sequence will be abandoned and the OS may need to reset the AFU and restart the download sequence from the beginning.

H_DOWNLOAD_CA_FACILITY The architectural intent of this hcall is to provide platform support for downloading a base adapter image to the coherent platform facility, and for validating the entire image after the download. The partition provides download data to the platform via an image scatter/gather list. The scatter/gather list can architecturally describe up to 1 megabyte of data (256 entries of 4096 bytes each). The OS must subdivide the base adapter image into chunks that are each 1 megabyte or less in size, and call H_DOWNLOAD_CA_FACILITY for each of those chunks. Base adapter image download requires two separate operations. The first is the download operation, which processes the entire image, possibly returning H_CONTINUE a number of times, and completing when H_Success is returned. The second is the validate operation, which again processes the entire image with a number of H_CONTINUE returns until it completes with H_Success. The base adapter image is not usable until both operations have completed successfully. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 operation: operation to perform to the coherent platform facility. Valid values are: Download: operation = 1, the base image in the coherent platform facility is first erased, and then programmed using the image supplied in the scatter/gather list. Validate: operation = 2, the base image in the coherent platform facility is compared with the image supplied in the scatter/gather list. uint64 scatter-gather-list-address: 4K naturally aligned real buffer containing scatter/gather list entries. The format of the scatter/gather list is the same as for the H_DOWNLOAD_CA_FUNCTION hcall. All fields in the scatter/gather list and all fields in the image header have big-endian byte ordering. uint64 num-scatter-gather-list-entries: number of block list entries in the scatter/gather list uint64 continue-token: Used to continue an operation if H_Busy or H_CONTINUE is returned. Set to zero on first call. If H_Busy or H_CONTINUE is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Base Adapter Image Header, Version 1 Name Offset Length Description Version 0 2 Version of the base adapter image header, value = 1 Reserved 2 6 Set to zero. Vendor ID 8 2 PCI Vendor ID of the coherent platform facility Device ID 10 2 PCI Device ID of the coherent platform facility Subsystem Vendor ID 12 2 PCI Subsystem Vendor ID of the coherent platform facility Subsystem ID 14 2 PCI Subsystem ID of the coherent platform facility Image Offset 16 8 Offset to the base adapter image bitstream Image Length 24 8 Length of the base adapter image bitstream Reserved 32 96 Set to zero Image Bitstream X Y Base adapter image bitstream, where X = Image Offset and Y = Image Length

Return Values: R4 on H_Busy or H_LongBusy or H_CONTINUE contains the continue-token to be used on the next call Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. If the coherent-platform-facility cannot be downloaded at this time due to a resource constraint, H_Resource is returned. If the coherent platform facility does not support download, return H_Function. If a download is in progress for the coherent platform facility, return H_State. If the partition does not have the authority to perform download (“ibm,privileged-function” property is 0 or not present), return H_Authority. If the coherent platform facility is in a Temporary Unavailable E rror State, return H_State. If the scatter-gather-list-address does not describe a 4K byte naturally aligned buffer, return H_Parameter. If the Base Adapter Image Header version is not supported by platform firmware, return H_BAD_DATA. If necessary, platform firmware disables the coherent platform facility from operation. For each entry in the scatter/gather list described by scatter-gather-list-address: Platform firmware validates address and length in the scatter/gather list entry. The buffer described should not cross a 4K page boundary. If invalid, returns H_SG_LIST. Platform firmware copies data from the scatter/gather list entry to the platform firmware buffer. Platform firmware verifies the image bitstream data chunk in the platform buffer. If platform firmware determines the image bitstream data chunk is not valid, return H_BAD_DATA. During this operation, H_Busy or H_LongBusy can be returned due to hcall maximum time limits, the partition should call back, until a non-busy return code is returned. Platform firmware performs the download or validate operation for the coherent platform facility, using the image bitstream data chunk. During this operation, H_Busy or H_LongBusy can be returned due to hcall maximum time limits, the partition should call again, until a non-busy return code is returned. If the coherent platform facility does not accept the download of the image bitstream data chunk or an error occurs while communicating with the hardware, H_Hardware is returned. If hcall time limit is exceeded, but more data is left to copy in the current scatter/gather list, H_Busy or H_LongBusy is returned. The partition should call back with the current scatter/gather list. Once every entry in the current scatter/gather list is copied, platform firmware returns H_CONTINUE. The partition then calls back with a new scatter/gather list for the next chunk of image data and the previous steps are repeated for each new list. This is repeated as long as H_CONTINUE is returned. If the validate operation completes successfully, platform firmware re-enables coherent platform facility for operation if necessary. H_Success is returned. Any error in the above steps will cause the download to be aborted. To complete the download, the partition must retry both H_DOWNLOAD_CA_FACILITY operations, including the Base Adapter Image header for each operation. After H_DOWNLOAD_CA_FACILITY is performed, the partition should call ibm,update-nodes and ibm,update-properties to receive the current configuration for the functions under this coherent platform facility. When H_DOWNLOAD_CA_FACILITY is first called, some adapter resources may be reserved for use during the download sequence, which may span multiple H_DOWNLOAD_CA_FACILITY calls, until the image download is complete as indicated by a return of H_SUCCESS. When H_CONTINUE is returned, indicating that more data is needed for the complete image, the OS must call H_DOWNLOAD_CA_FACILITY again within 3 seconds, or the download sequence may be abandoned and the OS may need to reset the facility and restart the download sequence from the beginning.

H_CONTROL_CA_FACILITY This H_CONTROL_CA_FACILITY hypervisor call allows the partition to manipulate or query certain coherent platform facility behaviors. Syntax: Parameters: uint64 unit-address: Unit Address per the device tree “reg” property of the coherent platform facility uint64 operation: operation to perform to the coherent platform facility. Valid values are: Reset: operation = 1, initiate a reset to the coherent platform facility, this is used when the partition needs to reset the coherent platform facility and all of its child coherent platform functions to a clean state. All attached processes and state are cleared by firmware after this reset. If a new base adapter image has been downloaded, that image will be activated. Collect VPD: operation = 2, collect VPD for the coherent platform facility. uint64 parameter1: parameter 1 for operations, meaning changes based on the operation. uint64 parameter2: parameter 2 for operations, meaning changes based on the operation. uint64 parameter3: parameter 3 for operations, meaning changes based on the operation. uint64 parameter4: parameter 4 for operations, meaning changes based on the operation. uint64 continue-token: Used to continue an operation if H_Busy is returned. Set to zero on first call. If H_Busy is returned then call again but use the value returned in R4 from the previous call as the value of continue-token. Operation Parameters Reset None Collect VPD Parameter1 = 4K naturally aligned real buffer containing scatter/gather list entries. All fields in the scatter/gather list have big-endian byte ordering. Parameter2 = number of entries in the scatter/gather list, valid values are between 0 and 256 Semantics: Verify that coherent platform facilities are licensed to be used, else return H_Authority. Verify that the unit-address parameter is valid else return H_Parameter. If operation is Reset: If coherent platform facility is in Temporarily Unavailable error state or is already performing a reset, return H_State. If partition is not allowed to perform a Reset (“ibm,privileged-facility” property is 0 or not present), return H_Authority. Set Temporarily Unavailable error state for the coherent platform facility and all child coherent platform functions. Initiate reset of the coherent platform facility. If the Reset operation exceeds the time allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy is returned. Return H_Success If operation is Collect VPD: If parameter1 does not describe a valid 4K aligned real address, return H_Parameter. If parameter2 is greater than 256, return H_Parameter. If a scatter/gather list entry specifies an invalid address, or specifies a buffer that crosses a page boundary, return H_SG_LIST. If the coherent platform facility is not in a Normal Error State, return H_State. If platform firmware cannot retrieve the VPD from the coherent platform facility, return H_Function. Platform firmware collects the VPD from the coherent platform facility and places it in the partition buffer described by parameter1 and parameter2. The data will be truncated as necessary to fit in the provided buffer. The data is stored as a bytestream; the first byte in the buffer corresponds to byte 0 of the VPD. If the Collect VPD operation exceeds the time allowed for an hcall, R4 is set to the continue-token and H_Busy or H_LongBusy is returned. If a hardware error occurs during the Collect VPD operation, return H_Hardware. Return H_Success, and R4 is set to the length of the available VPD, which may be different than the amount of data actually stored in the partition buffer. It may also be different than the value reported in the “ibm,vpd-size” property, though it will not be greater than that. A length of 0 means no VPD has been provided for the coherent platform facility. If operation is unknown, return H_Not_Found.