Virtualized Input/OutputVirtualized I/O is an optional feature of platforms that have
hypervisor support. Virtual I/O (VIO) provides to a given partition the
appearance of I/O adapters that do not have a one to one correspondence with
a physical IOA. The hypervisor uses one of three techniques to realize a
virtual IOA:In the
hypervisor simulated class, the hypervisor may totally
simulate the adapter. For example, this is used in the virtual ethernet (IEEE
VLAN) support (see
). This technique is applicable to
communications between partitions that are created by a single hypervisor
instance.In the
partition managed class, a
server partition provides the services of one of its
IOA’s to a
partner partition(s) (one or more
client partitionsThe term “hosted” is sometimes used for
“client” and the term “hosting” is sometimes used
for “server.” Note that a server IOA or partition can sometimes
also be a client, and vice versa, so the terminology “client”
and “server” tend to be less confusing than hosted and
hosting. or one or more server partitions). In limited cases, a client may
communicate directly to a client. A server partition provides support to
interpret I/O requests from the partner partition, perform those requests on
one or more of its devices, targeting the partner partition’s DMA
buffer areas (for example, by using the Remote DMA (RDMA) facilities), and
passing I/O responses back to the partner partition. For example, see
.In the
hypervisor managed class, the hypervisor may provide low
level hardware management (error and sub-channel allocation) so that
partition level code may directly manage its assigned sub-channels.This chapter is organized from general to specific. The overall
structure of this architecture is as shown in
.Terminology used with VIOBesides the general terminology defined on the first page of this
chapter,
will assist the reader in understanding the content of this chapter.
Terminology used with VIOTermDefinitionVIOVirtual I/O. General term for all virtual I/O classes and
virtual IOAs.ILLANInterpartition Logical LAN. This option uses the
hypervisor simulated class of virtual I/O to provide
partition-to-partition LAN facilities without a real LAN IOA.
See
.VSCSIVirtual SCSI. This option provides the facilities for
sharing physical SCSI type IOAs between partitions.
.ClientClient VIO modelThis terminology is mainly used with the partition
managed class of VIO. The client, or client partition, is an
entity which generally requests of a server partition, access
to I/O to which it does not have direct access (that is, access
to I/O which is under control of the server partition). Unlike
the server, the client does not provide services to other
partitions to share the I/O which resides in their partition.
However, it possible to have the same partition be both a
server and client partition, but under different virtual IOAs.
The Client VIO model is one where the client partition maps
part of its local memory into an RTCE table (as defined by the
first window pane of the
“ibm,my-dma-window”property),
so that the server partition can get access to that
client’s local memory. An example of this is the VSCSI
client (see
for more
information).ServerServer VIO modelThis terminology is mainly used with the partition
managed class of VIO. The server, or server partition is an
entity which provides a method of sharing the resources under
its direct control with another partition, virtualizing those
resources in the process. The following defines the Server VIO
model:The server is a server to a client. An example of this is
the VSCSI client (see
). In this case, the
Server VIO model is one where the server gets access to the
client partition’s local memory via what the client
mapped into an RTCE table. This access is done through the
second window pane of the server’s
“ibm,my-dma-window”property,
which is linked to the first window pane of the client’s
“ibm,my-dma-window”property.Partner partitionThis is “the other” partition in a pair of
partitions which are connected via a virtual IOA pair. For
client partitions, the partner is generally the server
(although, in limited cases, client to client connections may
be possible). For server partitions, the partner can be a
client partition or another server partition.RTCE tableRemote DMA TCE table. TCE (Translation Control Entry) and
RTCE tables are used to translate I/O DMA operations and
provide protection against improper operations (access to what
should not be accessed or for protection against improper
access modes, like writing to a read only page). More
information on TCEs and TCE tables, which are used for physical
IOAs, can be found in
. The RTCE table for Remote
DMA (RDMA) is analogous to the TCE table for physical IOAs. The
RTCE table does, however, have a little more information in it
(as placed there by the hypervisor) in order to, among other
things, allow the hypervisor to create links to physical IOA
TCEs that were created from the RTCE table TCEs. A TCE in an
RTCE table is never accessed directly by the partitions
software; only though hypervisor hcall()s. For more information
on RTCE table and operations, see
, and
.Window pane(“ibm,my-dma-window” property)The RTCE tables for VIO DMA are pointed to by the
“ibm,my-dma-window” property in
the device tree for each virtual device. This property can have
one, two, or three triples, each consisting of a Logical I/O
Bus Number (LIOBN), phys which is 0, and size. The LIOBN
essentially points to a unique RTCE table (or a unique entry
point into a single table. The phys is a value of 0, indicating
offsets start at 0. The size is the size of the available
address space for mapping memory into the RTCE table. This
architecture talks about these unique RTCE tables as being
window panes within the
“ibm,my-dma-window” property.
Thus, there can be up to three window panes for each virtual
IOA, depending on the type of IOA. For more on usage of the
window panes, see
.RDMARemote Direct Memory Access is DMA transfer from the
server to its client or from the server to its partner
partition. DMA refers to both physical I/O to/from memory
operations and to memory to memory move operations.Copy RDMAThis term refers to when the hypervisor is used (possibly
with hardware assist) to move data between server partition and
client partition memories or between server partition and
partner partition memories. See
.Redirected RDMAThis term refers to when the TCE(s) for a physical IOA
are set up through the use of the RTCE table manipulation
hcall()s (for example, H_PUT_RTCE) such that the client or
partner’s partition’s RTCE table (though the second
window pane of the server partition) is used by the hypervisor
during the processing of the hcall() to setup the TCE(s) for
the physical IOA, and then the physical IOA DMAs directly to or
from the client or partner partition’s memory. See
for more
information.LRDMAStands for Logical Remote DMA and refers to the set of
facilities for synchronous RDMA operations. See also
for more information.
LRDMA is a separate option.Command/Response Queue (CRQ)The CRQ is a facility which is used to communicate
between partner partitions. Transport events which are signaled
from the hypervisor to the partition are also reported in this
queue.Subordinate CRQ (Sub-CRQ)Similar to the CRQ, except with notable differences (See
).Reliable Command/Response TransportThis is the CRQ facility used for synchronous VIO
operations to communicate between partner partitions. Several
hcall()s are defined which allow a partition to place an entry
on the partner partition’s queue. The firmware can also
place transport change of status messages into the queue to
notify a partition when the connection has been lost (for
example, due to the other partition crashing or deregistering
its queue). See
for more
information.Subordinate CRQ TransportThis is the Sub-CRQ facility used for synchronous VIO
operations to communicate between partner partitions when the
CRQ facility by itself is not sufficient. The Subordinate CRQ
Transport never exists without a corresponding Reliable
Command/Response Transport. See
for more
information.
VIO Architectural InfrastructureVIO is used in conjunction with the Logical Partitioning option as
described in
. For each of a platform’s
partitions, the number and type of VIO adapters with the associated
interpartition communications paths (if any are defined). These definitions
take the architectural form of VIO adapters and are communicated to the
partitions as device nodes in their OF device tree. Depending upon the
specific virtual device, their device tree node may be found as a child of
/ (the root node) or in the VIO sub-tree (see
below).The VIO infrastructure provides several primitives that may be used
to build connections between partitions for various purposes (that is, for
various virtual IOA types). These primitives include:A Command/Response Queue (CRQ) facility which provides a pipe
between partitions. A partition can enqueue an entry on its partner’s
CRQ for processing by that partner. The partition can set up the CRQ to
receive an interrupt when the queue goes from empty to non-empty, and hence
this facility provides a method for an inter-partition interrupt.A Subordinate CRQ (Sub-CRQ) facility that may be used in
conjunction with the CRQ facility, when the CRQ facility by itself is not
sufficient. That is, when more than one queue with more than one interrupt
is required by the virtual IOA.An extended TCE table called the RTCE table which allows a
partition to provide “windows” into the memory of its partition
to its partner partition, while maintaining addressing and access control
to its memory.Remote DMA services that allow a server partition to transfer data
to a partner partition’s memory via the RTCE table window panes. This
allows a device driver in a server partition to efficiently transfer data
to and from a partner, which is key in sharing of an IOA in the server
partition with its partner partition.In addition to the virtual IOAs themselves, this architecture defines
a virtual host bridge, and a virtual interrupt source controller. The
virtual host bridge roots the VIO sub-tree. The virtual interrupt source
controller provides the consistent syntax for communicating the interrupt
numbers the partition’s OS sees when the virtual IOAs signal an
interrupt.The general VIO infrastructure is defined in
. There are additional
infrastructures requirements for the partition managed class based on the
Synchronous VIO model. See
.VIO Infrastructure - GeneralThis section describes the general OF device tree structure for
virtual IOAs and describes in more detail window panes, as well as
describing the interrupt control aspects of virtual IOAs.Properties of the
/vdevice OF Tree NodeMost VIO adapters are represented in the OF device tree as children
of the
/vdevice node (child of the root node). While the
vdevice sub-tree is the preferred architectural home for VIO adapters,
selected devices for historical reasons, are housed outside of the
vdevice sub-tree.R1--1./vdevice node must contain the properties as defined
in .
Properties of the /vdevice NodeProperty NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name, the value shall be
“vdevice”“device_type”YStandard property name per
,
specifying the virtual device type, the value shall be
“vdevice”“model”NAProperty not present.“compatible”YStandard property name per
, specifying
the virtual device programming models, the value shall include
“IBM,vdevice”“used-by-rtas”NAProperty not present.“ibm,loc-code”NAThe location code is meaningless unless one is doing
dynamic reconfiguration as in the children of this node.“reg”NAProperty not present.“#size-cells”YStandard property name per
,
the value shall be 0. No child of this node takes
space in the address map as seen by the owning
partition.“#address-cells”YStandard property name per
,
the value shall be 1.“#interrupt-cells”YStandard property name per
,
the value shall be 2. The first cell contains the
interrupt# as will appear in the XIRR and is used as input to
interrupt RTAS calls. The second cell contains the value 0
indicating a positive edge sense“interrupt-map-mask”NAProperty not present.“interrupt-ranges”YStandard property name that defines the interrupt
number(s) and range(s) handled by this unit.“ranges”These will probably be needed for IB virtual
adapters.“interrupt-map”NAProperty not present.“interrupt-controller”YThe
/vdevice node appears to contain an
interrupt controller.“ibm,drc-indexes”For DRRefers to the DR slots -- the number provided is the
maximum number of slots that can be configured which is limited
by, among other things, the RTCE tables allocated by the
hypervisor.“ibm,drc-power-domains”For DRValue of -1 to indicate that no power manipulation is
possible or needed.“ibm,drc-types”For DRValue of
“SLOT”. Any virtual IOA can
fit into any virtual slot.“ibm,drc-names”For DRThe virtual location code (see
)“ibm,drc-info”For DRWhen present replaces the “ibm,drc-indexes”,
“ibm,drc-power-domains”,
“ibm,drc-types” and
“ibm,drc-names” properties.
This single property is a consolidation of the four pre-existing properties
and contains all of the required information.“ibm,max-virtual-dma-size”See definition columnThe maximum transfer size for H_SEND_LOGICAL_LAN and
H_COPY_RDMA hcall()s. Applies to all VIO which are children of
the
/vdevice node. Minimum value is 128
KB.
RTCE Table and Properties of the Children of the
/vdevice NodeThis architecture defines an extended type of TCE table called a
Remote DMA TCE (RTCE) table. An RTCE table is one that is not directly
used by the hardware to translate an I/O adapter’s DMA addresses,
but is used by the hypervisor to translate a partition’s I/O
addresses. RTCE tables have extra data, compared with a standard TCE
table, to help firmware manage the use of its mappings. A partition
manages the entries for its memory that is to be the target for I/O
operations in the RTCE table using the TCE manipulation hcall()s,
depending on the type of window pane. More on this later in this section.
On platforms implementing the CRQ LRDMA options, these hcall()s are
extended to understand the format of the RTCE table via the LIOBN
parameter that is used to address the specific window pane within an RTCE
tableOne could also think of each LIOBN pointing to a separate RTCE
table, rather than window panes within an RTCE table..Children of the
/vdevice node that support operations which use RTCE
tables (for example, RDMA) contain the
“ibm,my-dma-window” property. This
property contains one or more (logical-I/O-bus-number, phys, size)
triple(s). Each triple represents one window pane in an RTCE table which
is available to this virtual IOA. The phys value is 0, and hence the
logical I/O bus number (LIOBN) points to a unique range of TCEs in the
RTCE table which are assigned to this window pane (LIOBN), and hence the
I/O address for that LIOBN begin at 0.The LIOBN is an opaque handle which references a window pane within
an RTCE table. Since this handle is opaque, its internal structure is not
architected, but left to the implementation’s discretion. However,
it is the architectural intent that the LIOBN be an indirect reference to
the RTCE table through a hypervisor table that contains management
variables, allowing for movement of the RTCE table and table format
specific access methods. The partition uses an I/O address as an offset
relative to the beginning of the LIOBN, as part of any I/O request to
that memory mapped by that RTCE table’s TCEs. A server partition
appends its version of the LIOBN for the partner partition’s RTCE
table that represents the partner partition’s RTCE table which it
received through the second entry in the
“ibm,my-dma-window” property associated
with server partition’s virtual IOA’s device tree node (for
example, see
). The mapping between the
LIOBN in the second pane of a server virtual IOA’s
“ibm,my-dma-window” property and the
corresponding partner partition IOA’s RTCE table is made when the
CRQ successfully completes registration.The window panes and the hcall()s that are applicable to those
panes, are defined and used as indicated in
.
VIO Window Pane Usage and Applicable Hcall()sWindow Pane (Which Triple)Hypervisor Simulated ClassClient VIO ModelServer VIO ModelFirstI/O address range which is available to map local
partition memory for use by the hypervisorI/O address range which is available to map local
partition memory to make it available to the hypervisor use
(access to the CRQ and any Sub-CRQs).For clients which support RDMA operations from their
partner partition to their local memory (for example, VSCSI),
this I/O address range is available to map local partition
memory to make it available to the server partition, and this
pane gets mapped to the second window pane of the partner
partition (client/server relationship).I/O address range which is available to map local
partition memory for use by the hypervisor (for access by
H_COPY_RDMA requests, and for access to the CRQ, any
Sub-CRQs).This window is not available to any other
partition.Applicable hcall()s: H_PUT_TCE, H_GET_TCE,
H_PUT_TCE_INDIRECT, H_STUFF_TCESecondDoes not existDoes not existI/O address range which corresponds to a window pane of
the partner partition: linked to the first window pane for
Client/Server model connections.Used to get access to the partner partition’s
memory from the hypervisor that services the local partition
for use as source or destination in Copy RDMA requests or for
redirected DMA operations (for example, H_PUT_RTCE).Applicable hcall()s: H_PUT_RTCE, H_REMOVE_RTCE and
H_PUT_RTCE_INDIRECT
The
“ibm,my-dma-window” property is the per
device equivalent of the
“ibm,dma-window” property found in nodes
representing bus bridges.Children of the
/vdevice node contain virtual location codes in their
“ibm,loc-code” properties. The invariant
assignment number is uniquely generated when the virtual IOA is assigned
to the partition and remains invariably associated with that virtual IOA
for the duration of the partition definition. For more information, see
.VIO Interrupt ControlThere are two hcall()s that work in conjunction with the RTAS calls
ibm,int-on,
ibm,int-off,
ibm,set-xive and
ibm,get-xive, which manage the state of the
interrupt presentation controller logic. These hcall()s provide the
equivalent of IOA control registers used to control IOA interrupts. The
usage of these two hcall()s is summarized in
. The detail of the
H_VIO_SIGNAL is shown after this table and the detail of the applicable
H_VIOCTL subfunctions can be found in
,
, and
.
VIO Interrupt Control hcall() UsageInterrupt FromVirtual IOA Definition doesnotInclude Sub-CRQsVirtual IOA DefinitionIncludesSub-CRQsInterrupt Number Obtained
FromCRQH_VIO_SIGNALH_VIO_SIGNAL
or
H_VIOCTLOF device tree
“interrupts” propertySub-CRQNot ApplicableH_VIOCTLH_REG_SUB_CRQ hcall()
H_VIO_SIGNALThis H_VIO_SIGNAL hcall() manages the interrupt mode of a virtual
adapter’s CRQ interrupt signalling logic. There are two modes:
Disabled, and Enabled.The first interrupt of the
“interrupts” property is for the
CRQ.Syntax:Parameters:unit-address: unit address per device tree node
“reg” property.mode:Bit 63 controls the first interrupt specifier given in the
virtual IOA’s
“interrupts” property, and bit 62 the
second. High order bits not associated with an interrupt source as
defined in the previous sentence should be set to zero by the caller and
ignored by the hypervisor.A bit value of 1 enables the specified interrupt, a bit value of
0 disables the specified interrupt.Semantics:Validate that the unit address belongs to the partition and to a
vdevice IOA, else H_Parameter.Validate that the mode is one of those defined, else
H_Parameter.Establish the specified mode.Return H_Success.General VIO RequirementsR1--1.For all VIO options: The platform must be running in
LPAR mode.R1--2.For all VIO options: The platform’s OF device
tree must include, as a child of the
root node, a node of type
vdevice as the parent of a sub-tree representing the
virtual IOAs assigned to the partition (see
for details).R1--3.For all VIO options: The platform’s
/vdevice node must contain properties as defined in
.R1--4.For all VIO options: If the platform is going to
limit the size of virtual I/O data copy operations (e.g.,
H_SEND_LOGICAL_LAN and H_COPY_RDMA), then the platform’s
/vdevice node must contain the
“ibm,max-virtual-dma-size” property, and
the value of this property must be at least 128 KB.R1--5.For all VIO options: The interrupt server numbers for
all interrupt source numbers, virtual and physical, must come from the
same name space and are defined by the
“ibm,interrupt-buid-size” property in the
PowerPC External Interrupt Presentation
Controller Node.R1--6.For all VIO options: The virtual interrupts for all
children of the
/vdevice node must, upon transfer of control to the
booted partition program, be masked as would be the result of an
ibm,int-off RTAS call specifying the virtual
interrupt source number.R1--7.For all VIO options with the Reliable Command/Response
option: The platform must specify the CRQ interrupt as the
first interrupt in the
“interrupts” property for a virtual
IOA.R1--8.For the SMC options:
The platform must specify the ASQ interrupt as the second interrupt in the
“interrupts” property for a virtual IOA.R1--9.For all VIO options: The platform must implement the
H_VIO_SIGNAL hcall() as defined in
.R1--10.For all VIO options: The platform must assign an
invariant virtual location code to each virtual IOA as described in
.
R1--11.(Requirement Number Reserved For Compatibility)R1--12.For all VIO options: The
phys of each
“ibm,my-dma-window” property triple
(window pane) must have a value of zero and the LIOBN must be
unique.Implementation Note: While the architectural
definition of LIOBN would allow the definition of one logical I/O bus
number (LIOBN) for all RTCE tables (IOBA ranges separating IOAs), such an
implementation is not permitted for the VIO option, which requires a
unique LIOBN (at least per partition preferably platform wide) for each
virtual IOA window pane. Such designs allow the LIOBN handle to be used
to validate access rights, and allows each subsequent I/O bus address
range to start at zero, providing maximum accommodation for 32 bit
OS’s.R1--13.For the VSCSI option:
For the server
partition, there must exist two triples (two window panes) in the
“ibm,my-dma-window” property and the size
field of the second triple (second window pane) of an
“ibm,my-dma-window” property must be
equal to the size field of the corresponding first triple (first window
pane) of the associated partner partition’s
“ibm,my-dma-window” property.R1--14.For the SMC option:
There must exist three triples (three window panes) in the
“ibm,my-dma-window”
property of all partitions which contain an SMC virtual IOA, and the size field
of the second triple (second window pane) of an
“ibm,my-dma-window” property must be equal to the
size field of the corresponding third triple (third window pane) of the associated partner partition’s
“ibm,my-dma-window” property.R1--15.For all VIO options:
RTCE tables for
virtual IOAs, as pointed to by the partitions’ first window pane of
the
“ibm,my-dma-window” property, and the
TCEs that they contain (as built by the TCE hcall()s) must be persistent
across partner partition reboots and across partner partition deregister
(free)/re-register operations, even when the partition which connects
after one deregisters is a different partition, and must be available to
have TCEs built in them by said partition, as long as that partition
still owns the corresponding virtual IOA (an LRDR operation which removes
the IOA will also remove the RTCE table).R1--16.For all VIO options: The connection between the
second window pane of the
“ibm,my-dma-window” property for a
partition and its corresponding window pane in the partner partition
(first window pane) must be broken by the platform when either partition
deregisters its CRQ or when either partition terminates, and the platform
must invalidate any redirected TCEs copied from the said second window
pane (for information on invalidation of TCEs, see
).
R1--17.For all VIO options: The following window panes of
the “ibm,my-dma-window”
property, when they exist, must support the following specified hcall()s, when they are
implemented:For the first window pane: H_PUT_TCE, H_GET_TCE,
H_PUT_TCE_INDIRECT, H_STUFF_TCEFor the second window pane: H_PUT_RTCE, H_REMOVE_RTCE,
H_PUT_RTCE_INDIRECTR1--18.For all VIO options:
The platform must not prohibit the server and partner partition, or client and partner
partition, from being the same partition, unless the user interface used
to setup the virtual IOAs specifically disallows such
configurations.R1--19.For all VIO options: Any child node of the
/vdevice node that is not defined by this
architecture must contain the
“used-by-rtas” property.Implementation Notes:Relative to Requirement
, partner partitions being the
same partition makes sense from a product development standpoint.The
ibm,partner-control RTAS call does not make sense if
the partner partitions are the same partition.R1--20.For all VIO options: The platform must implement the
H_VIOCTL hcall() following the syntax of
and semantics specified by
.
Shared Logical ResourcesThe sharing of resources, within the boundaries of a single
coherence domain, owned by a partition owning a server virtual IOA by its
clients (those owning the associated client virtual IOAs) is controlled
by the hcall()s described in this section. The owning partition retains
control of and access to the resources and can ask for their return or
indeed force it. Refer to
for a graphic representation of
the state transitions involved in sharing logical resources.Owners of resources can grant, to one or more client
partitions, access to any of its resources. A client partition being
defined as a partition with which the resource owner is authorized to
register a CRQ, as denoted via an OF device tree virtual IOA node.
Granting access is accomplished by requesting that the hypervisor
generate a specific cookie for that resource for a specific sharing
partition. The cookie value thus generated is unique only within the
context of the partition being granted the resource and is unusable for
gaining access to the resource by any other partition. This unique cookie
is then communicated via some inter partition communications channel,
most likely the authorized Command Response Queue. The partner partition
then accepts the logical resource (mapping it into the accepting
partition’s logical address space). The owning partition may grant
shared access of the same logical resource to several clients (by
generating separate cookies for each client). During the time the
resource is shared, both the owner and the sharer(s) have access to the
logical resource, the software running in these partitions use private
protocols to synchronize control access. Once the resource has been
accepted into the client’s logical address space, the resource can
be used by the client in any way it wishes, including granting it to one
of its own clients. When the client no longer needs access to the shared
logical resource, it destroys any virtual mappings it may have created
for the logical resource and returns the logical resource thus unmapping
it from its logical address space. The client program could, subsequently
accept the logical resource again (given that the cookie is still valid).
To complete the termination of sharing, the owner partition rescinds the
cookie describing the shared resource. Normally a rescind operation
succeeds only if the client has returned the resource, however, the owner
can force the rescind in cases where it suspects that the client is
incapable of gracefully returning the resource.In the case of a forced rescind, the hypervisor marks the client
partition’s logical address map location corresponding to the
shared logical resource such that any future hcall() that specifies the
logical address fails with an H_RESCINDED return code. The hypervisor
then ensures that the client partition’s translation tables contain
no references to a physical address of the shared logical
resource.Should the server partition fail, the hypervisor automatically
notifies client partitions of the fact via the standard CRQ event
message. In addition, the hypervisor recovers any outstanding shared
logical resources prior to restarting the server partition. This recovery
is proceeded by a minimum of two seconds of delay to allow the client
partitions time to gracefully return the shared logical resources, then
the hypervisor performs the equivalent of a forced rescind operation on
all the server partition’s outstanding shared logical
resources.This architecture does not specify a method of implementation,
however, for the sake of clarifying the specified function, the following
example implementation is given, refer to
. Assume that the hypervisor
maintains for each partition a logical to physical translation table (2)
(used to verify the partition’s virtual to logical mapping
requests). Each logical resource (4) mapped within the logical to real
translation table has associated with it a logical resource control
structure (3) (some of the contents of this control structure are defined
in the following text). The original logical resource control structures
(3) describe the standard logical resources allocated to the partition
due to the partition’s definition, such as one per Logical Memory
Blocks (LMB), etc.The platform firmware, when creating the OF device tree for a given
partition knows the specific configuration of virtual IOAs with the
associated quantity of the various types of logical resources types for
each virtual IOA. From that knowledge, it understands the number and type
of resources that must be shared between the server and client partitions
and therefore the number of control structures that will be needed. When
an owning partition grants access to a subset of one of its logical
resources to another partition, the hypervisor chooses a logical resource
control structure to describe this newly granted resource (6), (as stated
above, the required number of these control structures were allocated
when the client virtual IOA was defined) and attaches it to the
grantee’s base partition control structure (5). This logical
resource control structure is linked (9) to the base logical resource
control structure (3) of the resource owner. Subsequently the
grantee’s OS may accept the shared logical resource (4) mapping it
(7) into the grantee’s partition logical to physical map table (8).
This same set of operations may subsequently be performed for other
partition(s) (10). The shared resource is always a subset (potentially
the complete subset) of the original. Once a partition (10) has accepted
a resource, it may subsequently grant a subset of that resource to yet
another partition (14), here the hypervisor creates a logical resource
control structure (13) links it (12) to the logical resource control
structure (11) of the granting partition (10) that is in turn linked (9)
to the owner’s logical resource control structure (3).For the OS to return the logical resource represented by
control structure (11), the grant represented by control structure (13)
needs to be rescinded. This is normally accomplished only after the OS
that is running partition (14) performs a return operation, either
because it has finished using the logical resource, or in response to a
request (delivered by inter partition communications channel) from the
owner. The exceptions are in the case that either partition terminates
(the return operation is performed by the hypervisor) and a
non-responsive client (when the granter performs a forced rescind). A
return operation is much like a logical resource dynamic reconfiguration
isolate operation, the hypervisor removes the logical resource from the
partition’s logical to physical map table, to prevent new virtual
to physical mappings of the logical resource, then ensures that no
virtual to physical mappings of the logical resource are outstanding
(this can either be accomplished synchronously by checking map counts
etc. or asynchronously prior to the completion of the rescind operation.R1--1.For the Shared Logical Resource option: The platform
must implement the hcall-logical-resource function set following the
syntax and semantics of the included hcall(s) as specified in:,,, and.R1--2.For the Shared Logical Resource option: In the event
that the partition owning a granted shared logical resource fails, the
platform must wait for a minimum of 2 seconds after notifying the client
partitions before recovering the shared resources via an automatic
H_RESCIND_LOGICAL (forced) operation.H_GRANT_LOGICALThis hcall() creates a cookie that represents the specific instance
of the shared object. That is, the specific subset of the original
logical resource to be shared with the specific receiver partition. The
owning partition makes this hcall() in preparation for the sharing of the
logical resource subset with the receiver partition. The resulting cookie
is only valid for the specified receiver partition.The caller needs to understand the bounds of the logical resource
being granted, such as for example, the logical address range of a given
LMB. The generated cookie does not span multiple elemental logical
resources (that is resources represented by their own Dynamic
Reconfiguration Connector). If the owner wishes to share a range of
resources that does span multiple elemental logical resources, then the
owner uses a series of H_GRANT_LOGICAL calls to generate a set of
cookies, one for each subset of each elemental logical resource to be
shared.The “logical” parameter identifies the starting
“address” of the subset of the logical resource to be shared.
The form of this “address” is resource dependent, and is
given in
.Syntax:Parameters:
Format of H_GRANT_LOGICAL parametersFlags subfunction code (bits 16-23)
value:Access RestrictionBits 16-19The defined bits in this field have
independent meanings, and may appear in combination with all
other bits unless specifically disallowed. (an x in the
binary field indicates that the bit can take on a value of 0
or 1)
0b1xxxRead access inhibited (The grantee may
not read from or grant access to read from this logical
resource.)0bx1xxWrite access inhibited (The grantee may
not write to or grant access to write to this logical
resource.)
0bxx1xRe-Grant rights inhibited (the grantee
may not grant access to this logical resource to a subsequent
client.)
0bxxx1Reserved calling software should set
this bit to zero. Firmware returns H_Parameter if
set.
Logical ResourceSupported CombinationsBits 20-23“address”
description“length”
descriptionSystem Memory0bxxx00x1Logical Address (as would be used in H_ENTER) in
logical-lo; logical-hi not used (should be = 0)Bytes in units of 4 K on 4 K boundaries (low order 12
bits = 0)MMIO Space0bxxx00x2Logical Address (as would be used in H_ENTER) in
logical-lo; logical-hi not used (should be = 0)Bytes in units of 4 K on 4 K boundaries (low order 12
bits = 0)Interrupt Source0b00x00x424 bit Interrupt # (as would be used in
ibm,get-xive) in low order 3 bytes in
logical-lo; logical-hi not used (should be = 0)value=1 (the logical resource is one indivisible
unit)DMA Window Pane
Note: The DMA window only refers to physical DMA
windows not virtual DMA windows. Virtual DMA windows can be
directly created with a client virtual IOA definition and
need not be shared with those of the server.
0b00x00x532 bit LIOBN in logical-hi; with a 64 bit IOBA
logical-loBytes of IOBA in units of 4 K on 4 K boundaries (low
order 12 bits = 0)Interprocessor Interrupt Port0b00x00x6Processor Number. (As from the processor’s Unit ID)
in logical-lo; logical-hi not used (should be = 0).value=1 (the logical resource is one indivisible
unit)
unit-address: The unit address of the virtual IOA associated with
the shared logical resource, and thus the partner partition that is to
share the logical resource.Semantics:Verify that the flags parameter specifies a supported logical
resource type, else return H_Parameter.Verify that the logical address validly identifies a logical
resource of the type specified by the flags parameter and owned/shared by
the calling partition, else return H_Parameter; unless:The logical address’s page number represents a page that
has been rescinded by the owner, then return H_RESCINDED.There exists a grant restriction on the logical resource, then
return H_Permission.Verify that the length parameter is of valid form for the
resource type specified by the flags parameter and that it represents a
subset (up to the full size) of the logical resource specified by the
logical address parameter, else return H_Parameter.Verify that the unit-address is for a virtual IOA owned by the
calling partition, else return H_Parameter.If the partner partition’s client virtual IOA has
sufficient resources, generate hypervisor structures to represent, for
hypervisor management purposes, including any grant restrictions, the
specified shared logical resource, else return H_NOMEM.Generate a cookie associated with the hypervisor structures
created in the previous step that the partner partition associated with
the unit-address can use to reference said structures via the
H_ACCEPT_LOGICAL and H_RETURN_LOGICAL hcall()s and the calling partition
can use to reference said structures via the H_RESCIND_LOGICAL
hcall().Place the cookie generated in the previous step in R4 and return
H_Success.H_RESCIND_LOGICALThis hcall() invalidates a logical sharing as created by
H_GRANT_LOGICAL above. This operation may be subject to significant
delays in certain circumstances. Callers may experience an extended
series of H_PARTIAL returns prior to successful completion of this
operation.If the sharer of the logical resource has not successfully
completed the H_RETURN_LOGICAL operation on the shared logical resource
represented by the specified cookie, the H_RESCIND_LOGICAL hcall() fails
with the H_Resource return code unless the “force” flag is
specified. The use of the “force” flag increases the
likelihood that a series of H_PARTIAL returns will be experienced prior
to successful completion. The “force” flag also causes the
hcall() to recursively rescind any and all cookies that represent
subsequent sharing of the logical resource. That is, if the original
client subsequently granted access to any or all of the logical resource
to a client, those cookies and any other subsequent grants are also
rescinded.Syntax:Parameters:flags: The flags subfunction code field (bits 16-23) two values are
defined 0x00 “normal”, and 0x01 “forced”.cookie: The handle returned by H_GRANT_LOGICAL representing the
logical resource to be rescinded.Semantics:Verify that the cookie parameter references an outstanding
instance of a shared logical resource owned/accepted by the calling
partition, else return H_Parameter.Verify that the flags parameter is one of the supported values,
else return H_Paramter.If the “force” flag is specifiedImplementations should provide mechanisms to ensure that reserved
flag field bits are zero, to improve performance, implementations may
chose to activate this checking only in “debug” mode. The
mechanism for activating an implementation dependent debug mode is
outside of the scope of this architecture., then: perform the functions of H_RETURN_LOGICAL (cookie) as
if called by the client partition. Note this involves forced rescinding
any cookies generated by the client partition that refer to the logical
resource referenced by the original cookie being rescinded.If the client partition has the resource referenced by cookie is
in the available for mapping via its logical to physical mapping table
(the resource was accepted and not returned), return H_Resource.Verify that resource reference by cookie is not mapped by the
client partition, else return H_PARTIAL.Hypervisor reclaims control structures referenced by cookie and
returns H_Success.H_ACCEPT_LOGICALThe H_ACCEPT_LOGICAL hcall() maps the granted logical resource into
the client partition’s logical address space. To provide the most
compact client logical address space, the hypervisor maps the resource
into the lowest applicable logical address for the referenced logical
resource type, consistent with the resource’s size and the resource
type’s constraints upon alignment etc. The chosen logical address
for the starting point of the logical mapping is returned in register
R4.Syntax:Parameters:cookie: The handle returned by H_GRANT_LOGICAL representing the
logical resource to be accepted.Semantics:Verify that the cookie parameter is valid for the calling
partition, else return H_Parameter.If the cookie represents a logical resources that has been
rescinded by the owner, return H_RESCINDED.Map the resources represented by the cookie parameter, with any
attendant access restrictions, into the lowest available logical address
of the calling partition consistent with constraints of size and
alignment and place the selected logical address into R4.Return H_Success.H_RETURN_LOGICALThe H_RETURN_LOGICAL hcall() unmaps the logical resource from the
client partition’s logical address space. Prior to calling
H_RETURN_LOGICAL the client partition should have destroyed all virtual
mappings to the section of the logical address space to which the logical
resource is mapped. That is unmapping virtual addresses for MMIO and
System Memory space, invalidating TCEs mapping a shared DMA window pane,
disabling/masking shared interrupt sources and/or inter processor
interrupts. Failing to do so, may result in parameter errors for other
hcall()s and H_Resource from the H_RETURN_LOGICAL hcall(). Part of the
semantic of this call is to determine that no such active mapping exists.
Implementations may be able to determine this quickly if they for example
maintain map counts for various logical resources, if an implementation
searches a significant amount of platform tables, then the hcall() may
return H_Busy and maintain internal state to continue the scan on
subsequent calls using the same cookie parameter. The cookie parameter
remains valid for the calling client partition until the server partition
successfully executes the H_RESCIND_LOGICAL hcall().Syntax:Parameters:cookie: The handle returned by H_GRANT_LOGICAL representing the
logical resource to be returned.Semantics:Verify that the cookie parameter references an outstanding
instance of a shared logical resource accepted by the calling partition,
else return H_Parameter.Remove the referenced logical resource from the calling
partition’s logical address map.Verify that no virtual to logical mappings exist for the
referenced resource, else return H_Resource.This operation may require extensive processing -- in some cases
the hcall may return H_Busy to allow for improved system responsiveness
-- in these cases the state of the mapping scan is retained in the
hypervisor’s state structures such that after some number of
repeated calls the function is expected to finish.Return H_Success.H_VIOCTLThe H_VIOCTL hypervisor call allows the partition to manipulate or
query certain virtual IOA behaviors.Command OverviewSyntax:Parameters:unit-address: As specified.subfunction: Specific subfunction to perform; see
.parm-1: Specified in subfunction semantics.parm-2: Specified in subfunction semantics.parm-3: Specified in subfunction semantics.Semantics:Validate that the subfunction is implemented, else return
H_Not_Found.Validate the unit-address, else return H_Parameter.Validate the subfunction is valid for the given virtual IOA, else
return H_Parameter.Refer to
to determine the semantics for
the given subfunction.
Semantics for H_VIOCTL subfunction parameter valuesSubfunction NumberSubfunction NameRequired?Semantics Defined in0x0(Reserved)(Reserved)(Reserved)0x1GET_VIOA_DUMP_SIZEFor all VIO options.0x2GET_VIOA_DUMP.0x3GET_ILLAN_NUMBER_VLAN_IDSFor the ILLAN option.0x4GET_ILLAN_VLAN_ID_LIST.0x5GET_ILLAN_SWITCH_IDFor the ILLAN option0x6DISABLE_MIGRATIONFor all vscsi-server and vfc-server.0x7ENABLE_MIGRATION.0x8GET_PARTNER_INFO.0x9GET_PARTNER_WWPN_LISTFor all vfc-server.0xADISABLE_ALL_VIO_INTERRUPTSFor the Subordinate CRQ Transport option0xBDISABLE_VIO_INTERRUPT0xCENABLE_VIO_INTERRUPT0xDGET_ILLAN_MAX_VLAN_PRIORITYNo0xEGET_ILLAN_NUMBER_MAC_ACLSNo0xFGET_MAC_ACLSNo0x10GET_PARTNER_UUIDFor UUID Option0x11FW_RESETFor the VNIC option.0x12Get_ILLAN_SWITCHING_MODEFor any ILLAN adapter with the
“ibm,trunk-adapter” property0x13DISABLE_INACTIVE_TRUNK_RECEPTIONFor any ILLAN adapter with the
"ibm,trunk-adapter" property0x14GET_MAX_REDIRECTED_MAPPINGSFor platforms that support more than a single Redirected RDMA
mapping per virtual TCE.0x18VNIC_SERVER_STATUSFor VNIC servers0x19GET_SESSION_TOKENFor VNIC clients0x1ASESSION_ERROR_DETECTEDFor VNIC clients0x1BGET_VNIC_SERVER_INFOFor VNIC servers0x1CILLAN_MAC_SCANFor any ILLAN adapter with the
“ibm,trunk-adapter”
property0x1DENABLE_PREPARE_FOR_SUSPENDFor all vscsi-server and vfc-server0x1EREADY_FOR_SUSPENDFor all vscsi-server and vfc-server
GET_VIOA_DUMP_SIZE Subfunction SemanticsValidate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.The hypervisor calculates the size necessary for passing opaque
firmware data describing current virtual IOA state to the partition for
purposes of error logging and RAS, and returns H_Success, with the
required size in R4.GET_VIOA_DUMP Subfunction SemanticsIf the given virtual IOA has an
“ibm,my-dma-window” property in its
device tree, then parm-1 is an eight byte output descriptor. The high
order byte of an output descriptor is control, the next three bytes are a
length field of the buffer in bytes, and the low order four bytes are a
TCE mapped I/O address of the start of a buffer in I/O address space. The
high order control byte must be set to zero. The TCE mapped I/O address
is mapped via the first window pane of the
“ibm,my-dma-window” property.If the given virtual IOA has no
“ibm,my-dma-window” property in its
device tree, then parm-1 shall be a logical real, page-aligned address of
a 4 K page used to return the virtual IOA dump.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.If parm-1 is an output descriptor, thenValidate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.Transfer as much opaque hypervisor data as fits into the output
buffer as specified by the output descriptor.If all opaque data will not fit due to size, return
H_Constrained, else return H_Success.If parm-1 is a logical real address, thenValidate the logical real address is valid for the partition,
else return H_Parameter.Transfer as much opaque hypervisor data as will fit into the
passed logical real page, with a maximum of 4 K.If all opaque data will not fit in the page due to size, return
H_Constrained, else return H_Success.GET_ILLAN_NUMBER_VLAN_IDS Subfunction SemanticsValidate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.The hypervisor returns H_Success, with the number of VLAN IDs
(PVID + VIDs) in R4. This subfunction allows the partition to allocate
the correct amount of space for the call:
H_VIOCTL(GET_VLAN_ID_LIST).GET_ILLAN_VLAN_ID_LIST Subfunction Semanticsparm-1 is an eight byte output descriptor. The high order byte of
an output descriptor is control, the next three bytes are a length field
of the buffer in bytes, and the low order four bytes are a TCE mapped I/O
address of the start of a buffer in I/O address space. The high order
control byte must be set to zero. The TCE mapped I/O address is mapped
via the first window pane of the
“ibm,my-dma-window” property.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.Validate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.Transfer the VLAN_ID_LIST into the output buffer as specified by
the output descriptor. The data will be an array of two byte values,
where the first element of the array is the PVID followed by all the
VIDs. The format of the elements of the array is specified by IEEE VLAN
documentation. Any unused space in the output buffer will be
zeroed.If all VLAN IDs do not fit due to size, return
H_Constrained.Return H_SuccessGET_ILLAN_SWITCH_ID Subfunction Semanticsparm-1 is an eight byte output descriptor. The high order byte of
an output descriptor is control, the next three bytes are a length field
of the buffer in bytes, and the low order four bytes are a TCE mapped I/O
address of the start of a buffer in I/O address space. The high order
control byte must be set to zero. The TCE mapped I/O address is mapped
via the first window pane of the
“ibm,my-dma-window” property.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.Validate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.Transfer the GET_ILLAN_SWITCH_ID into the output buffer as
specified by the output descriptor. The data will be a string of ASCII
characters uniquely identifying the virtual switch to which the ILLAN
adapter is connected. Any unused space in the output buffer will be
zeroed.If the switch identifier does not fit due to size, return
H_Constrained.Return H_SuccessDISABLE_MIGRATION Subfunction SemanticsWhen this subfunction is implemented, the
“ibm,migration-control” property exists
in the
/vdevice OF device tree node.Validate that parm-1, parm-2, and parm-3 are all set to zero,
else return H_Parameter.If no partner is connected, then return H_Closed.Prevent the migration of the partner partition to the destination
server until either the ENABLE_MIGRATION subfunction is called or
H_FREE_CRQ is called.Return H_Success.ENABLE_MIGRATION Subfunction SemanticsWhen this subfunction is implemented, the
“ibm,migration-control” property exists
in the
/vdevice OF device tree node.Validate that parm-1, parm-2, and parm-3 are all set to zero,
else return H_Parameter.Validate that the migration of the partner partition to the
destination server was previously prevented with DISABLE_MIGRATION
subfunction, else return H_Parameter.Enable the migration of the partner partition.Return H_Success.GET_PARTNER_INFO Subfunction SemanticsParm-1 is an eight byte output descriptor. The high order byte of
an output descriptor is control, the next three bytes are a length field
of the buffer in bytes, and the low order four bytes are a TCE mapped I/O
address of the start of a buffer in I/O address space. The high order
control byte must be set to zero. The TCE mapped I/O address is mapped
via the first window pane of the
“ibm,my-dma-window” property.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.Validate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.If the output buffer is not large enough to fit all the data,
then return H_Constrained.If no partner is connected and more than one possible partner
exists, then return H_Closed.Transfer the eight byte partner partition ID into the first eight
bytes of the output buffer.Transfer the eight byte unit address into the second eight bytes
of the output buffer.Transfer the NULL-terminated Converged Location Code associated
with the partner unit address and partner partition ID immediately
following the unit address.Zero any remaining output buffer.Return H_Success.GET_PARTNER_WWPN_LIST Subfunction SemanticsThis subfunction is used to get the WWPNs for the partner from the
hypervisor. In this way, there is assurance that the WWPNs are
accurate.Parm-1 is an eight byte output descriptor. The high order byte of
an output descriptor is control, the next three bytes are a length field
of the buffer in bytes, and the low order four bytes are a TCE mapped I/O
address of the start of a buffer in I/O address space. The high order
control byte must be set to zero. The TCE mapped I/O address is mapped
via the first window pane of the
“ibm,my-dma-window” property.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.Validate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.If the output buffer is not large enough to fit all the data,
return H_Constrained.If no partner is connected, return H_Closed.Transfer the first eight byte WWPN, which is represented in the
vfc-client node of the partner partition in the
“ibm,port-wwn-1” parameter, into the
first eight bytes of the output buffer.Transfer the second eight byte WWPN, which is represented in the
vfc-client node of the partner partition in the
“ibm,port-wwn-2” parameter, into the
second eight bytes of the output buffer.Zero any remaining output buffer.Return H_Success.DISABLE_ALL_VIO_INTERRUPTS Subfunction
SemanticsThis subfunction is used to disable any and all the CRQ and Sub-CRQ
interrupts associated with the virtual IOA designated by the
unit-address, for VIOs that define the use of Sub-CRQs. Software that
controls a virtual IOA that does not define the use of Sub-CRQ facilities
should use the H_VIO_SIGNAL hcall() to disable CRQ interrupts.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this subfunction might change, and the caller should be prepared to
receive an H_Not_Found return code indicating the platform does not
implement this subfunction.Validate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.Disable all CRQ and any Sub-CRQ interrupts associated with
unit-address.Return H_Success.DISABLE_VIO_INTERRUPT Subfunction SemanticsThis subfunction is used to disable a CRQ or Sub-CRQ interrupt, for
VIOs that define the use of Sub-CRQs. The CRQ or Sub-CRQ is defined by
the unit-address and parm-1. Software that controls a virtual IOA that
does not define the use of Sub-CRQ facilities should use the H_VIO_SIGNAL
hcall() to disable CRQ interrupts.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this subfunction might change, and the caller should be prepared to
receive an H_Not_Found return code indicating the platform does not
implement this subfunction.Parm-1 is the interrupt number of the interrupt to be disabled.
For an interrupt associated with a CRQ this number is obtained from the
“interrupts” property in the device tree
For an interrupt associated with a Sub-CRQ this number is obtained during
the registration of the Sub-CRQ (H_REG_SUB_CRQ).Validate parm-1 is a valid interrupt number for a CRQ or Sub-CRQ
for the virtual IOA defined by parm-1 and that parm-2 and parm-3 are set
to zero, else return H_Parameter.Disable interrupt specified by parm-1.Return H_Success.ENABLE_VIO_INTERRUPT Subfunction SemanticsThis subfunction is used to enable a CRQ or Sub-CRQ interrupt, for
VIOs that define the use of Sub-CRQs. The CRQ or Sub-CRQ is defined by
the unit-address and parm-1. Software that controls a virtual IOA that
does not define the use of Sub-CRQ facilities should use the H_VIO_SIGNAL
hcall() to disable CRQ interrupts.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this subfunction might change, and the caller should be prepared to
receive an H_Not_Found return code indicating the platform does not
implement this subfunction.Parm-1 is the interrupt number of the interrupt to be enabled.
For an interrupt associated with a CRQ this number is obtained from the
“interrupts” property in the device tree
For an interrupt associated with a Sub-CRQ this number is obtained during
the registration of the Sub-CRQ (H_REG_SUB_CRQ).Validate parm-1 is a valid interrupt number for a CRQ or Sub-CRQ
for the virtual IOA defined by unit-address and that parm-2 and parm-3
are set to zero, else return H_Parameter.Enable interrupt specified by parm-1.Return H_Success.GET_ILLAN_MAX_VLAN_PRIORITY Subfunction SemanticsValidate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.The hypervisor returns H_Success, with the maximum IEEE 802.1Q
VLAN priority returned in R4. If no priority limits are in place, the
maximum VLAN priority is returned in R4.GET_ILLAN_NUMBER_MAC_ACLS Subfunction SemanticsThis subfunction allows the partition to allocate the correct
amount of space for the GET_MAC_ACLS Subfunction call.Validate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.The hypervisor returns H_Success, with the number of allowed MAC
addresses returned in R4. If no MAC access control limits are in place, 0
is returned in R4.GET_MAC_ACLS Subfunction Semanticsparm-1 is an eight byte output descriptor. The high order byte of
an output descriptor is control, the next three bytes are a length field
of the buffer in bytes, and the low order four bytes are a TCE mapped I/O
address of the start of a buffer in I/O address space. The high order
control byte must be set to zero. The TCE mapped I/O address is mapped
via the first window pane of the
“ibm,my-dma-window” property.Validate parm-2 and parm-3 are set to zero, else return
H_Parameter.Validate the I/O address range is in the required DMA window and
is mapped by valid TCEs, else return H_Parameter.Transfer the allowed MAC addresses into the output buffer as
specified by the output descriptor. The data will be an array of 8 byte
values containing the allowed MAC address, with the low order 6 bytes
containing the 6 byte MAC address. Any unused space in the output buffer
are zeroed.If all allowed MAC addresses do not fit due to size, return
H_Constrained.Return H_SuccessGET_PARTNER_UUID Subfunction SemanticsValidate parm-1, parm-2 and parm-3 are set to zero, else return
H_Parameter.If no partner is connected and more than one possible partner
exists, then return H_Closed.Transfer into registers R4 (High order 8 bytes) and R5 (low order
8 bytes) of the UUID of the client partition that owns the virtual device
(
for the format of the UUID string.Return H_SuccessFW_Reset Subfunction SemanticsThis H_VIOCTL subfunction will reset the VNIC firmware associated
with a VNIC client adapter, if currently active. This subfunction is
useful when the associated firmware becomes unresponsive to other
CRQ-based commands. For the case of vTPMs the firmware will be left
inoperable until the client partition next boots up.Semantics:Validate that parm-1, parm-2, and parm-3 are all set to zero,
else return H_Parameter.If the firmware associated with the virtual adapter can not be
reset, return H_Constrained.Reset the firmware associated with the virtual adapter.Return H_Success.GET_ILLAN_SWITCHING_MODE Subfunction SemanticsValidate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.Validate that the given virtual IOA is a ILLAN adapter with the
“ibm,trunk-adapter”, else return H_Parameter.The hypervisor returns H_Success, with the current switching mode
in R4. If the switching mode is VEB mode, R4 will have a 0. If the
switching mode is VEPA mode, R4 will have a 1.DISABLE_INACTIVE_TRUNK_RECEPTION Subfunction
SemanticsThis subfunction is used to disable the reception of all packets
for a ILLAN trunk adapter that is not the Active Trunk Adapter as set by
the H_ILLAN_ATTRIBUTES hcall.Note: The default value for this attribute is
ENABLED. The value is reset on a successful H_FREE_LOGICAL_LAN hcall or
reboot/power change of the partition owning the ILLAN adapter.Validate parm-1, parm-2, and parm-3 are set to zero, else return
H_Parameter.Validate that the given virtual IOA is a ILLAN adapter with the
“ibm,trunk-adapter”, else return H_Parameter.The hypervisor disables reception of packets for this adapter
when it is not the Active Trunk Adapter.Return H_Success.GET_MAX_REDIRECTED_MAPPINGS Subfunction
SemanticsThis sub-function retrieves the maximum number of additional
redirected mappings for the specified adapter.Validate parm-1, parm-2, and parm-3 are set to zero, else return
H_ParameterValidate that the given virtual IOA is an RDMA-capable server adapter,
else return H_Parameter.Store the maximum number of additional redirected mappings for
the LIOBN in R4.Store the maximum number of redirections per client IOBA in R5.Return H_SuccessVNIC_SERVER_STATUS Subfunction SemanticsThis subfunction is used to report the status of the physical
backing device corresponding to a specific VNIC server adapter.
Additionally, this subfunction is used as a heartbeat mechanism
that the hypervisor utilizes to ensure the backing device associated
with the virtual adapter is responsive.parm-1 is an enumerated value reflecting the physical backing
evice status. Validate that parm-1 is one of the following values:
0x1 (Operational), 0x2 (LinkDown), or 0x3 (AdapterError). Otherwise,
return H_Parameter.parm-2 is a value, in milliseconds, that the caller utilizes
to specify how long the hypervisor should wait for the next server
status vioctl call.Validate that parm-3 is zero, else return H_Parameter.If the CRQ for the server adapter has not yet been registered,
return H_State.Return H_Success.GET_SESSION_TOKEN Subfunction SemanticsThis subfunction is used to obtain a session token from a VNIC client adapter.
This token is opaque to the caller and is intended to be used in tandem with the
SESSION_ERROR_DETECTED vioctl subfunction.On platforms that implement the partition migration option, after partition
migration the support for this subfunction might change, and the caller
should be prepared to receive an H_Not_Found return code indicating the platform
does not implement this subfunction.Validate that parm-1, parm-2, and parm-3 are 0, else return
H_Parameter.Return H_Success, with the session token in R4.SESSION_ERROR_DETECTED Subfunction SemanticsThis subfunction is used to report that the currently active
backing device for a VNIC client adapter is behaving poorly, and
that the hypervisor should attempt to fail over to a different backing
device, if one is available.On platforms that implement the partition migration option,
after partition migration the support for this subfunction might change,
nd the caller should be prepared to receive an H_Not_Found return code
indicating the platform does not implement this subfunction.parm-1 is a VNIC session token. This token should be obtained
from the GET_SESSION_TOKEN vioctl subfunction.Validate that parm-2 and parm-3 are 0, else return H_Parameter.Validate that the session token parameter corresponds to the current
VNIC session, else return H_State.Validate that the active server status is Operational, else return
H_Constrained.If the server status is Operational, change the server status to
NetworkError and attempt to fail over to a different backing device.
If there are no suitable servers to fail over to, return H_Constrained.If the client successfully failed over to another backing device
as a result of this subfunction call, return H_Success.GET_VNIC_SERVER_INFO Subfunction SemanticsThis subfunction is used to fetch information about a VNIC server
adapter.parm-1 is an eight byte output descriptor. The high order
byte of an output descriptor is control, the next three bytes are
a length field of the buffer in bytes, and the low order four bytes
are a TCE mapped I/O address of the start of a buffer in I/O address
space. The high order control byte must be set to zero. The TCE mapped
I/O address is mapped via the first window pane of the
“ibm,my-dma-window”
property.Validate that parm-2 and parm-3 are 0, else return H_Parameter.Populate the TCE mapped buffer with the following information.
Note that if the buffer descriptor (parm-1) describes an output buffer
that is not large enough to hold the following information, the server
information will be truncated to the size of the output buffer and the
buffer will be populated with the truncated information.
FieldByte OffsetSizeDescriptionVersion08The format version of the provided information.
The first supported version is 1.Active88Boolean value describing whether or not this server
adapter is currently the active server for the client
adapter it is mapped to.
0x0 - The server is not currently active.0x1 - The server is currently the active backing
device for the client.Status168Enumeration value corresponding to the current
virtual adapter status.
0x1 - Operational - The server adapter is working
as expected.0x2 - LinkDown - The SR-IOV backing device's
physical link is down.0x3 - AdapterError - SR-IOV adapter is undergoing
EEH.0x4 - PoweredOff - The virtual server adapter
or its hosting partition is powered off.0x5 - NetworkError - VNIC client detected a network
issue with this adapter.0x6 - Unresponsive - The hypervisor is not reliably
receiving VNIC_SERVER_STATUS vioctl calls from the VNIC
server.Priority241The current priority of this server adapter. Lower
values take precedence over larger values.Reserved257This field is reserved and must be zero.Return H_Success.ILLAN_MAC_SCAN Subfunction Semanticsparm-1 is an eight byte output descriptor. The high order
byte of an output descriptor is control, the next three bytes are
a length field of the buffer in bytes, and the low order four bytes
are a TCE mapped I/O address of the start of a buffer in I/O address
space. The high order control byte must be set to zero. The TCE mapped
I/O address is mapped via the first window pane of the
“ibm,my-dma-window”
property.Parm-2 and parm-3 should be set to the opaque continuation tokens
CT1 and CT2, respectively. These values are returned by the hypervisor
through the ILLAN_MAC_SCAN Buffer header when a scan cannot be completed
within a single vioctl call. See
for more information about the values CT1 and CT2. Parm-2 and parm-3
should be set to zero when starting a new ILLAN_MAC_SCAN call sequence.Validate that the unit-address corresponds to an active ILLAN
trunk adapter, else return H_Parameter.Validate parm-2 and parm-3 are both set to zero, or contain
valid continuation tokens, else return H_Parameter.Validate that the I/O address range supplied by parm-1 is large
enough to hold the header information for the ILLAN_MAC_SCAN Buffer
detailed in
,
else return H_Parameter.Validate that the I/O address supplied by parm-1 is 8-byte aligned,
else return H_Parameter.If any data transfers to the I/O address range supplied by parm-1
fail, return H_Permission.Iterate over all VLAN ids associated with the specified trunk
adapter. For each associated VLAN id:
Iterate over all ILLAN adapters, barring any adapters
with the
“ibm,trunk-adapter”
property, belonging to the current VLAN id.For each non-trunk ILLAN adapter belonging to the current
VLAN id, add a 64-bit value containing the current 12-bit VLAN id
and the 48-bit MAC address of the ILLAN adapter to the next vacant
entry in the MAC/VID Buffer. Each MAC/VID pair in the buffer
is formatted as shown in
Note that when handling H_IN_PROGRESS return codes, the caller
should either copy information from the buffer, immediately
process the information in the buffer, or modify the output
descriptor to utilize a new, non-overlapping buffer I/O address
range after each call. Otherwise, the buffer data will be overwritten
on consecutive calls.
MAC/VID Pair Entry FormatFieldBit OffsetBit LengthRESERVED04802.1qVLAND ID412Adapter MAC Address1648
If at any point during iteration the vioctl call exceeds
the maximum allotted time interval, or if the MAC/VID buffer
is filled to capacity, do the following:
Store 'CT1' and 'CT2' in the buffer header so the
operation can be continued on the next callSet 'Num Entries' in the buffer header to the number
of valid MAC/VID pairs in the output bufferSet 'Reconfiguration Occurred' based on the rules
described in the Dynamic Reconfiguration description item belowReturn H_IN_PROGRESS.
ILLAN_MAC_SCAN Buffer FormatFieldByte OffsetByte LengthDescriptionHeaderCT108Continuation token 1 – this value should be used as
parm-2 for sequential calls to ILLAN_MAC_SCAN when
handling H_IN_PROGRESS return codesCT288Continuation token 2 – this value should be used
as parm-3 for sequential calls to ILLAN_MAC_SCAN when
handling H_IN_PROGRESS return codes.Reserved1615This field is reserved and should be set to zero.Reconfiguration Occurred3110: The data in this buffer is guaranteed to be
consistent with the virtual adapter configuration at
the point of return1: The data in this buffer may contain data
inconsistencies due to reconfiguration of
ILLAN adapters between consecutive calls
to ILLAN_MAC_SCAN. See
Dynamic Reconfigurations description below.Num Entries328The number of valid 64-bit MAC/VID pairs in this bufferDataMAC/VID Buffer Start40VariableA variably-sized, contiguous array of MAC/VID pairs
formatted according to
.
The number of valid entries in this array is
specified by the “Num Entries” field.
If all MAC addresses were successfully scanned for all VLAN ids on
the trunk adapter, do the following:
Set 'CT1' and 'CT2' to zero in the buffer headerSet 'Num Entries' in the buffer header to the number of
valid MAC/VID pairs in the output bufferSet 'Reconfiguration Occurred' based on the rules described
in the Dynamic Reconfiguration section belowReturn H_SUCCESS.Note that the buffer headers and data are only valid if this
vioctl returns H_IN_PROGRESS or H_SUCCESS.Note that any unused buffer space outside of the range determined
by the 'Num Entries' field in the ILLAN_MAC_SCAN buffer header is
undefined by this architecture.Dynamic Reconfigurations:
If the 'Reconfiguration Occurred' field in the ILLAN_MAC_SCAN buffer
header is TRUE (1), the data in all MAC/VID buffers in a call sequence
may contain inconsistencies due to dynamic reconfiguration events for
the trunk adapter itself or any ILLAN adapters associated with the
trunk adapter. In this case, all data collected from the call sequence
should be utilized with caution, or re-queried. Possible inconsistencies
arising from dynamic reconfiguration include the following:
MAC addresses in the buffer may correspond to ILLAN adapters
that have been removed from the switch, due to partition suspension,
hibernation, adapter disablement, and DLPAR operationsMAC addresses corresponding to ILLAN adapters that were
added to the virtual switch due to partition resumption,
adapter enablement, and DLPAR operations may not be included
in the bufferILLAN adapters may have their VLAN memberships reconfigured,
in which case certain VID/MAC pairs in the buffer may no longer
be valid, and some valid VID/MAC pairs for ILLAN adapters may
not be included in the buffer at all.ILLAN adapters may have their MAC address reconfigured.
Both the old and new MAC addresses for the adapter may be included
in the buffer, or the old and new MAC addresses for the adapter
may not be included in the buffer at all.
Note that even if the value of the 'Reconfiguration Occurred' field is
FALSE (0), ILLAN adapter reconfigurations may have occurred immediately
after the vioctl completed, and the information in the buffer could be
outdated.
ENABLE_PREPARE_FOR_SUSPEND Subfunction SemanticsThis subfunction is used to enable the “Prepare For Suspend”
transport event on a VSCSI or VFC server adapter for which this
function is called. If enabled and when a client partition is
about to be migrated, the “Prepare For Suspend” transport event
will be enqueued on the server's command queue, if active.
The server should then quiesce all I/O to the backing store and
respond when ready by calling the H_VIOCTL READY_FOR_SUSPEND subfunction.
This subfunction should be called after each H_REG_CRQ as H_REG_CRQ will
disable the support. When enabled, a “Resume” transport event will be
enqueued on the server's command queue, if active, when the client
partition resumes regardless of whether it successfully suspended.
Parm-1 is an eight byte timeout value in milliseconds. The
timeout value specifies the maximum amount of time in milliseconds
the Hypervisor should wait after enqueueing the “Prepare for Suspend”
transport event on the server's command queue until receiving the
H_VIOCTL READY_FOR_SUSPEND subfunction from the server. The
timeout value should take into account the maximum amount of
time to queisce I/O operations prior to migration of the client
partition.Validate that the unit-address corresponds to a VSCSI or VFC
server adapter, else return H_Parameter.Validate parm-1 is less than or equal to 30,000 milliseconds,
else return H_Parameter.Validate parm-2, and parm-3 are set to zero, else return H_Parameter.Verify the server adapter is not configured for “Any client can connect”,
else return H_ConstrainedIf this Subfunction parameter value not supported by Hypervisor,
return H_Not_Found.If “Prepare For Suspend” is successfully enabled, H_Success will
be returned and an opaque Hypervisor version placed in R4READY_FOR_SUSPEND Subfunction SemanticsThis subfunction is used to respond to the “Prepare For Suspend” transport
event on a VSCSI or VFC server adapter for which this function is called. If
enabled via the H_VIOCTL ENABLE_PREPARE_FOR_SUSPEND subfunction, the server
should call this H_VIOCTL READY_FOR_SUSPEND subfunction after receiving the
“Prepare For Suspend” transport event and quiescing I/O operations to the
backing store. If the server is unable to call this subfunction within the
timeout specified in the H_VIOCTL ENABLE_PREPARE_FOR_SUSPEND subfunction,
the migration operation on the client partition will be aborted.
Validate that the unit-address corresponds to a VSCSI or VFC
server adapter, else return H_Parameter.Validate parm-1, parm-2, and parm-3 are set to zero, else
return H_Parameter.Validate that the server has previously called the H_VIOCTL
ENABLE_PREPARE_FOR_SUSPEND subfunction after the most recent
H_REG_CRQ, else return H_Constrained.If this Subfunction parameter value not supported by Hypervisor,
return H_Not_Found.Partition Managed Class Infrastructure - GeneralIn addition to the general requirements for all VIO described in
, the architecture for the
partition managed class of VIO defines several other
infrastructures:A Command/Response Queue (CRQ) that allows communications back
and forth between the server partition and its partner partition (see
).A Subordinate CRQ (Sub-CRQ) facility that may be used in
conjunction with the CRQ facility, when the CRQ facility by itself is not
sufficient. That is, when more than one queue with more than one
interrupt is required by the virtual IOA. See
.A mechanism for doing RDMA, which includes:A mechanism called Copy RDMA that can be used by the device
driver to move blocks of data between memory of the server and partner
partitionsA mechanism for Redirected RDMA that allows the device driver to
direct DMA of data from the server partition’s physical IOA to or
from the partner partition’s memory (see
).The mechanisms for the synchronous type VIO are described as
follows:Command/Response Queue (CRQ)The CRQ facility provides ordered delivery of messages between
authorized partitions. The facility is reliable in the sense that the
messages are delivered in sequence, that the sender of a message is
notified if the transport facility is unable to deliver the message to
the receiver’s queue, and that a notification message is delivered
(providing that there is free space on the receive queue), or if the
partner partition either fails or deregisters its half of the transport
connection. The CRQ facility does not police the contents of the payload
portions (after the 1 byte header) of messages that are exchanged between
the communicating pairs, however, this architecture does provide means
(via the Format Byte) for self describing messages such that the
definitions of the content and protocol between using pairs may evolve
over time without change to the CRQ architecture, or its
implementation.The CRQ is used to hold received messages from the partner
partition. The CRQ owner may optionally choose to be notified via an
interrupt when a message is added to their queue.CRQ Format and RegistrationThe CRQ is built of one or more 4 KB pages aligned on a 4 KB
boundary within partition memory. The queue is organized as a circular
buffer of 16 byte long elements. The queue is mapped into contiguous I/O
addresses via the TCE mechanism and RTCE table (first window pane). The
I/O address and length of the queue are registered by
. This registration process
tells the hypervisor where to find the virtual IOA’s CRQ.CRQ Entry FormatEach CRQ entry consists of a 16 byte element. The first byte of a
CRQ entry is the Header byte and is defined in
.
CRQ Entry Header Byte ValuesHeader ValueDescription0Element is unused -- all other bytes in the element are
undefined0x01 - 0x7FReserved0x80Valid Command/Response entry -- the second byte defines
the entry format (for example, see
).0x81 - 0xBFReserved0xC0Valid Initialization Command/Response entry -- the second
byte defines the entry format. See
.0xC1 - 0xFEReserved0xFFValid Transport Event -- the second byte defines the
specific transport event. See
.
The platform (transport mechanism) ignores the contents of all
non-header bytes in all CRQ entries.Valid Command/Response entries (Header byte 0x80) are used to carry
data between communicating partners, transparently to the platform. The
second byte of the entry is reserved for a Format byte to enable the
definitions of the content and protocol between using pairs to evolve
over time. The definition of the second byte of the Valid
Command/Response entry is beyond the scope of this architecture.
presents example VSCSI format
byte values.The Valid Initialization Command/Response entry (Header byte 0xC0)
is used during virtual IOA initialization sequences. The second byte of
this type entry is architected and is as defined in
. This format is used for
initialization operations between communicating partitions. The remaining
bytes (byte three and beyond) of the Valid Initialization
Command/Response entry are available for definition by the communicating
entities.
Initialization Command/Response Entry Format Byte
DefinitionsFormat Byte ValueDefinition0x0Unused0x1Initialize0x2Initialization Complete0x03 - 0xFEReserved0xFFReserved for Expansion
Valid Transport Events (Header byte 0xFF) are used by the platform
to notify communicating partners of conditions associated with the
transport channel, such as the failure of the partner’s partition
or the deregistration of the partner’s queue. The partner’s
queue may be deregistered as a means of resetting the transport channel
or simply to terminate the connection. When the Header byte of the queue
entry specifies a Valid Transport Event, then the second byte of the CRQ
entry defines the type of transport event. The Format byte (second byte)
of a Valid Transport Event queue entry is architected and is as defined
in
).
Transport Event CodesCode ValueExplanation0Unused1Partner partition failed2Partner partition deregistered CRQ3456Partner partition suspended (for the Partition Suspension
option)0x07 - 0x08Reserved0x09Prepare for Client Adapter Suspend
See 0x0AClient Adapter Resume
See 0x0B - 0xFFReserved
The “partner partition suspended” transport event
disables the associated CRQ such that any H_SEND_CRQ hcall() (
) to the associated CRQ returns
H_Closed until the CRQ has been explicitly enabled using the H_ENABLE_CRQ
hcall (See
).CRQ Entry ProcessingPrior to the partition software registering the CRQ, the partition
software sets all the header bytes to zero (entry invalid). After
registration, the first valid entry is placed in the first element and
the process proceeds to the end of the queue and then wraps around to the
first entry again (given that the entry has been subsequently marked as
invalid). This allows both the partition software and transport firmware
to maintain independent pointers to the next element they will be
respectively using.A sender uses an infrastructure dependent method to enter a 16 byte
message on its partner’s queue (see
). Prior to enqueueing an entry
on the CRQ, the platform first checks if the session to the
partner’s queue is open, and there is a free entry, if not, it
returns an error. If the checks succeed, the contents of the message is
copied into the next free queue element, potentially notifying the
receiver, and returns a successful status to the caller.At the receiver’s option, it may be notified via an interrupt
when an element is enqueued to its CRQ. See
.When the receiver has finished processing a queue entry, it writes
the header to the value 0x00 to invalidate the entry and free it for
future entries.Should the receiver wish to terminate or reset the communication
channel it deregisters the queue, and if it needs to re-establish
communications, proceeds to register either the same or different section
of memory as the new queue, with the queue pointers reset to the first
entry.CRQ Facility Interrupt NotificationThe receiver can set the virtual interrupt associated with its CRQ
to one of two modes. These are:Disabled (An enqueue interrupt is not signaled.)Enabled (An enqueue interrupt is signaled on every
enqueue)Note: An enqueue is considered a pulse not a level.
The pulse then sets the memory element within the emulated interrupt
source controller. This allows the resetting of the interrupt condition
by simply issuing the H_EOI hcall() as is done with the PCI MSI
architecture rather than having to do an explicit interrupt reset as in
the case with PCI Level Sensitive Interrupt (LSI) architecture.The interrupt mechanism is capable of presenting only one interrupt
signal at a time from any given interrupt source. Therefore, no
additional interrupts from a given source are ever signaled until the
previous interrupt has been processed through to the issuance of an H_EOI
hcall(). Specifically, even if the interrupt mode is enabled, the effect
is to interrupt on an empty to non-empty transition of the queue.
However, as with any asynchronous posting operation race conditions are
to be expected. That is, an enqueue can happen in a window around the
H_EOI hcall(). Therefore, the receiver should poll the CRQ after an H_EOI
to prevent losing initiative.See ) for
information about interrupt control.Extensions to Other hcall()s for CRQH_MIGRATE_DMASince the CRQ is RTCE table mapped, the H_MIGRATE_DMA hcall() may
be requested to move a page that is part of the CRQ. The OS owner of the
queue is responsible for preventing its processors from modifying the
page during the migrate operation (as is standard practice with this
hcall()), however, the H_MIGRATE_DMA hcall() serializes with the CRQ
hcall()s to direct new elements to the migrated target page.H_XIRR, H_EOIThe CRQ facility utilizes a virtual interrupt source number to
notify the queue owner of new element enqueues. The standard H_XIRR and
H_EOI hcall()s are extended to support this virtual interrupt mechanism,
emulating the standard PowerPC Interrupt hardware with respect to the
virtual interrupt source number.CRQ Facility RequirementsR1--1.For the CRQ facility: The platform must implement the
CRQ as specified in
.R1--2.For the CRQ facility: The platform must reject CRQ
definitions that are not 4 KB aligned.R1--3.For the CRQ facility: The platform must reject CRQ
definitions that are not a multiple of 4 KB long.R1--4.For the CRQ facility: The platform must reject CRQ
definitions that are not mapped relative to the TCE mapping defined by
the first window pane of the virtual IOA’s
“ibm,my-dma-window” property.R1--5.For the CRQ facility:
The platform must
start enqueueing Commands/Responses to the newly registered CRQ starting
at offset zero and proceeding as in a circular buffer, each entry being
16 byte aligned.R1--6.For the CRQ facility:
The platform must
enqueue Commands/Responses only if the 16 byte entry is free (header byte
contains 0x00), else the enqueue operation fails.R1--7.For the CRQ facility: The platform must enqueue the
16 bytes specified in the validated enqueue request as specified in
Requirement
except as required by
Requirement
.R1--8.For the CRQ facility:
The platform must
not enqueue commands/response entries if the CRQ has not been registered
successfully or if after a successful completion, has subsequently
deregistered the CRQ.R1--9.For the CRQ facility: The platform (transport
mechanism) must ignore and must not modify the contents of all non-header
bytes in all CRQ entries.R1--10.For the CRQ facility: The first byte of a CRQ entry
must be the Header byte and must be as defined in
.R1--11.For the CRQ facility: The Format byte (second byte)
of a Valid Initialization CRQ entry must be as defined in
.R1--12.For the CRQ facility: The Format byte (second byte)
of a Valid Transport Event queue entry must be as defined in
.R1--13.For the CRQ facility: If the partner partition fails,
then the platform must enqueue a 16 byte entry starting with 0xFF01 (last
14 bytes unspecified) as specified in Requirement
except as required by
Requirements
and
.R1--14.For the CRQ facility: If the partner partition
deregisters its corresponding CRQ, then the platform must enqueue a 16
byte entry starting with 0xFF02 (last 14 bytes unspecified) as specified
in Requirement
except as required by
Requirements
and
.R1--15.ReservedR1--16.For the CRQ facility with the Partner Control
option: If the partner partition is terminated by request of
this partition via the
ibm,partner-control RTAS call, then the platform must
enqueue a 16 byte entry starting with 0xFF04 (last 14 bytes unspecified)
as specified in Requirement
except as required by
Requirements
and
when the partner partition has
been successfully terminated.R1--17.ReservedR1--18.For the CRQ facility option: Platforms that implement
the H_MIGRATE_DMA hcall() must implement that function for pages mapped
for use by the CRQ.R1--19.For the CRQ facility: The platforms must emulate the
standard PowerPC External Interrupt Architecture for the interrupt source
numbers associated with the virtual devices via the standard RTAS and
hypervisor interrupt calls and must extend H_XIRR and H_EOI hcall()s as
appropriate for CRQ interrupts.R1--20.For the CRQ facility: The platform’s OF must
disable interrupts from the using virtual IOA before initially passing
control to the booted partition program.R1--21.For the CRQ facility: The platform’s OF must
disable interrupts from the using virtual IOA upon registering the
IOA’s CRQ.R1--22.For the CRQ facility: The platform’s OF must
disable interrupts from the using virtual IOA upon deregistering the
IOA’s CRQ.R1--23.For the CRQ facility: The platform must present (as
appropriate per RTAS control of the interrupt source number) the
partition owning a CRQ the appearance of an interrupt, from the interrupt
source number associated, through the OF device tree node, with the
virtual device, when a new entry is enqueued to the virtual
device’s CRQ and when the last interrupt mode set was
“Enabled”, unless a previous interrupt from the interrupt
source number is still outstanding.R1--24.For the CRQ facility: The platform must not present
the partition owning a CRQ the appearance of an interrupt, from the
interrupt source number associated, through the OF device tree node, with
the virtual device, if the last interrupt mode set was
“Disabled”, unless a previous interrupt from the interrupt
source number is still outstanding.Redirected RDMA (Using H_PUT_RTCE, and
H_PUT_RTCE_INDIRECT)A server partition uses the hypervisor function, H_PUT_RTCE, which
takes as a parameter the opaque handle (LIOBN) of the partner
partition’s RTCE table (second window pane of
“ibm,my-dma-window”), an offset in the
RTCE table, the handle for one of the server partition's I/O adapter TCE
tables plus an offset within the I/O adapter's TCE table. H_PUT_RTCE then
copies the appropriate contents of the partner partition's RTCE table
into the server partition's I/O adapter TCE table. In effect, this
hcall() allows the server partition's I/O adapter to have access to a
specific section of the partner partition's memory as if it were the
server partition's memory. However, the partner partition, through the
hypervisor, maintains control over exactly which areas of the partner
partition's memory are made available to the server partition without the
overhead of the hypervisor having to directly handle each byte of the
shared data.The H_PUT_RTCE_INDIRECT, if implemented, takes as an input
parameter a pointer to a list of offsets into the RTCE table, and builds
the TCEs similar to the H_PUT_RTCE, described above.A server partition uses the hypervisor function, H_REMOVE_RTCE, to
back-out TCEs generated by the H_PUT_RTCE and H_PUT_RTCE_INDIRECT
hcall()s.The following rules guide the definition of the RTCE table entries
and implementation of the H_PUT_RTCE, H_PUT_RTCE_INDIRECT, H_REMOVE_RTCE,
H_MASS_MAP_TCE, H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE hcall()s.
Other implementations that provide the same external appearance as these
rules are acceptable. The architectural intent is to provide RDMA
performance essentially equivalent to direct TCE operations.The partner partition's RTCE table is itself never directly
accessed by an I/O Adapter (IOA), it is only accessed by the hypervisor,
and therefore it can be a bigger structure than the regular TCE table as
accessed by hardware (more fields).When a server partition asks (via an H_PUT_RTCE or
H_PUT_RTCE_INDIRECT hcall()) to have an RTCE table TCE copied to one of
the server partition's physical IOA's TCEs, or asks (via an
H_REMOVE_RTCE) to have an RTCE table entry removed from one of the server
partition’s physical IOA’s TCEs, the hypervisor atomically,
with respect to all RTCE table readers, sets (H_PUT_RTCE or
H_PUT_RTCE_INDIRECT) or removes (H_REMOVE_RTCE) a field in the copied
RTCE table entryThis is an example of where the earlier statement “Other
implementations that provide the same external appearance as these
rules are acceptable” comes into affect. For example, for an RTCE
table that is mapped with H_MASS_MAP_TCE, the pointer may not be in a
field of the actual TCE in the RTCE table, but could, for example, be
in a linked list, or other such structure, due to the fact that there
is not a one-to-one correspondence from the RTCE to the physical IOA
TCE in that case (H_MASS_MAP_TCE can map up to an LMB into one TCE, and
physical IOA TCEs only map 4 KB).. This field is a pointer to the copy of the RTCE table TCE in
the server partition’s IOA’s TCE table. (A per RTCE table TCE
lock is one method for the atomic setting of the copied RTCE table TCE
link pointer.)A server partition is guaranteed that it can create one redirected
mapping per RTCE table entry. By default if the
server partition tries to create another copy of the same RTCE table TCE
it gets an error return. Platforms that support
the H_VIOCTL hcall() might support multiple redirected
RTCE table mappings provided that they do not duplicate
existing mappings (the mappings are for different I/O operations);
if they do, the total number of such
multiple mappings per LIOBN and per RTCE page is communicated by the
“GET_MAX_REDIRECTED_MAPPINGS” sub-function of the H_VIOCTL hcall(). If the
“GET_MAX_REDIRECTED_MAPPINGS” sub-function of the
H_VIOCTL hcall() is not implemented then only
the default single copy is supported.Multiple mappings
of the same physical page are always allowed, as
long as they originate from different RTCE table TCEs
just like with physical IOA TCEs.When the partner partition issues an H_PUT_TCE,
H_PUT_TCE_INDIRECT, H_STUFF_TCE, or H_MASS_MAP hcall() to change his RTCE
table, the hypervisor finds the TCE tables in one of several states. A
number of these states represent unusual conditions, that can arise from
timing windows or error conditions. The hypervisor rules for handling
these cases are chosen to minimize its overhead while preventing one
partition’s errors from corrupting another partition’s
state.The RTCE table TCE is not currently in use: Clear/invalidate the
TCE copy pointer and enter the RTCE table TCE mapping per the input
parameters to the hcall().The RTCE table TCE contains a valid mapping and the TCE copy
pointer is invalid (NULL or other implementation dependent value) (The
previous mapping was never used for Redirected RDMA): Enter the RTCE
table TCE mapping per the input parameters to the hcall().The RTCE table TCE contains a valid mapping and the TCE copy
pointers reference TCEs that do not contain a valid copy of the
previous mapping in the RTCE table TCE. (The previous mapping was used
for Redirected RDMA, however, the server partition has moved on and is no
longer targeting the page represented by the old RTCE table TCE mapping):
Clear/invalidate the TCE copy pointers and enter the RTCE table TCE
mapping per the input parameters to the hcall().The RTCE table TCE contains a valid mapping and the TCE copy
pointers reference TCEs that do contain a valid copy of the previous
mapping in the RTCE table TCE (the previous mapping is still potentially
in use for Redirected RDMA, however, the partner partition has moved on
and is no longer interested in the previous I/O operation). The server
partition’s IOA may still target a DMA operation against the TCE
containing the copy of the RTCE table TCE mapping. The assumption is that
any such targeting is the result of a timing window in the recovery of
resources in the face of errors. Therefore, the server partition’s
TCE is considered invalid, but the server partition may or may not be
able to immediately invalidate the TCEs. For more information on
invalidation of TCEs, see
. The H_Resource return from an
H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE may be used to hold off
invalidation in this case.If a server partition terminates, the partner partition’s
device drivers time out the operations and resource recovery code
recovers the RTCE table TCEs. If the partner partition terminates, the
hypervisor scans the RTCE table and eventually invalidates all active
copies of RTCE table TCEs. For more information on invalidation of TCEs,
see .The server partition may use any of the supported hcall()s (see
) to manage the TCE tables used
by its IOAs. No extra restrictions are made to changes of the server
partition's TCE table beside those stated in 2 above. The server
partition can only target its own memory or the explicitly granted
partner partition’s memory.The H_MIGRATE_DMA hcall() made by a partner partition migrates
the page referenced by the RTCE table TCE but follows the RTCE table TCE
copy pointer, if valid, to the server partition’s IOA’s TCE
table to determine which IOA’s DMA to disable, thus allowing
migration of partner partition pages underneath server partition DMA
activity. In this case, however, the H_MIGRATE_DMA algorithm is modified
such that server partition’s IOA’s TCE table is atomically
updated, after the page migration but prior to enabling the IOA’s
DMA, only when its contents still are a valid copy of the partner
partition’s RTCE table TCE contents. The H_MIGRATE_DMA hcall() also
serializes with H_PUT_RTCE so that new copies of the RTCE table TCE are
not made during the migration of the underlying page.The server partition should never call H_MIGRATE_DMA for any
Redirected RDMA mapped pages, however, to check, the H_MIGRATE_DMA
hcall() is enhanced to check the Logical Memory Block (LMB) owner in the
TCE and reject the call if the LMB did not belong to the
requester.H_PUT_RTCEThis hcall() maps “count” number of contiguous TCEs in
an RTCE to the same number of contiguous IOA TCEs. The H_REMOVE_RTCE
hcall() is used to back-out TCEs built with H_PUT_RTCE hcall().
for that hcall().Syntax:Parameters:r-liobn: Handle of RDMA RTCE tabler-ioba: IO address per RDMA RTCE tableliobn: Logical I/O Bus Number of server TCE tableioba: I/O address as seen by server IOAcount: Number of consecutive 4 KB pages to mapSemantics:Validates r-liobn is from the second triple (second window pane)
of the server partition’s
“ibm,my-dma-window” property, else return
H_Parameter.Validates r-ioba plus (count * 4 KB) is within range of RTCE
table as specified by the window pane as specified by the r-liobn, else
return H_Parameter.Validates that the TCE table associated with liobn is owned by
calling partition, else return H_Parameter.If the Shared Logical Resource option is implemented and the
LIOBN, represents a logical resource that has been rescinded by the
owner, return H_RESCINDED.Validates that ioba plus (count * 4 KB) is within the range of
TCE table specified by liobn, else return H_Parameter.If the Shared Logical Resource option is implemented and the IOBA
represents a logical resource that has been rescinded by the owner,
return H_RESCINDED.For count entriesThe following is done in a critical section with respect to
updates to the r-ioba entry of the RTCE table TCECheck that the r-ioba entry of the RTCE table contains a valid
mapping (this requires having a completed partner connection), else
return H_R_Parm with the value of the loop count in R4.Prevent more redirected mappings of the same r-ioba than
the platform supports and/or duplicates: If the r-ioba
entry of the RTCE table TCE contains a valid pointer, and
if that pointer references a TCE that is a clone of the
r-ioba entry of the RTCE table TCE, then an additional
redirected mapping, if supported, is used else return
H_Resource with the value of the loop count in R4.Validate the liobn and ioba are not already mapped for
this entry else return H_IN_USE with the value of the
loop count in R4.Validate the number of redirected mappings for the
r-ioba does not exceed the
“ibm,max-rtce-mappings”
value for any of the adapters mapped by the RTCE else
return H_Resource with the value of the loop count in R4.Validate the number of redirected mappings for the r-ioba
does not exceed the per client IOBA value returned
from the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function
else return H_Resource with the value of the loop count in R4.Validate the new entry will not cause the number of
additional redirected mappings that have already been
made for this r-liobn to exceed the maximum retrieved by the H_VIOCTL
GET_MAX_REDIRECTED_MAPPINGS sub-function else return
H_Resource with the value of the loop count in R4.Copy the DMA address mapping from the r-ioba entry of the r-liobn
RTCE table to the ioba entry of the liobn TCE table and save a pointer to
the ioba entry of the liobn TCE table in the r-ioba entry of the r-liobn
RTCE table, or in a separate structure associated with the r-liobn RTCE
table.End Loop (The critical section lasts for one iteration of the
loop)Return H_SuccessImplementation Note: The PA requires the OS to issue a sync
instruction to proceed the signalling of an IOA to start an IO operation
involving DMA to guarantee the global visibility of both DMA and TCE
data. This hcall() does not include a sync instruction to guarantee
global visibility of TCE data and in no way diminishes the requirement
for the OS to issue it.Implementation Note: The execution time for this hcall() is
expected to be a linear function of the count parameter. Excessive size
of the count parameter may cause an extended delay.H_PUT_RTCE_INDIRECTThis hcall() maps “count” number of potentially
non-contiguous TCEs in an RTCE to the same number of contiguous IOA TCEs.
The H_REMOVE_RTCE hcall() is used to back-out TCEs built with the
H_PUT_RTCE_INDIRECT hcall().
for that hcall().Syntax:Parameters:buff-addr: The Logical Address of a page (4 KB, 4 KB boundary)
containing a list of r-ioba to be mapped via using the r-liobn RTCE
tabler-liobn: Handle of RTCE table to be used with r-ioba entries in
indirect buffer (second window pane from server
“ibm,my-dma-window”liobn: Logical I/O Bus Number of server TCE tableioba: I/O address as seen by server IOAcount: Number of consecutive IOA bus 4 KB pages to map (number of
entries in buffer)Semantics:Validates r-liobn is from the second triple (second window pane)
of the server partition’s
“ibm,my-dma-window” property, else return
H_Parameter.Validates buff-addr points to the beginning of a 4 KB page owned
by the calling partition, else return H_Parameter.If the Shared Logical Resource option is implemented and the
logical address’s page number represents a page that has been
rescinded by the owner, return H_RESCINDED.Validates that the TCE table associated with liobn is owned by
calling partition, else return H_Parameter.If the Shared Logical Resource option is implemented and the
LIOBN represents a logical resource that has been rescinded by the owner,
return H_RESCINDED.Validates that ioba plus (count * 4 KB) is within the range of
TCE table specified by liobn, else return H_Parameter.If the Shared Logical Resource option is implemented and the IOBA
represents a logical resource that has been rescinded by the owner,
return H_RESCINDED.If the count field is greater than 512 return H_Parameter.Copy (count * 8) bytes from the page specified by buff-addr to a
temporary hypervisor page for contents verification and processing (this
avoids the problem of the caller changing call by reference values after
they are checked).For count entries:Validate the r-ioba entry in the temporary page is within range
of RTCE table as specified by r-liobn, else place the count number in R4
and return H_R_Parm.End loopFor count validated entries in the hypervisor's temporary
page:The following is done in a critical section with respect to
updates to the r-ioba entry of the RTCECheck that the r-ioba entry of the r-liobn RTCE table contains a
valid mapping (this requires having a completed partner connection), else
return H_R_Parm with the count number in R4.Prevent more redirected mappings of the same r-ioba than
the platform supports and/or duplicates: If the r-ioba
entry of the RTCE table TCE contains a valid pointer, and
if that pointer references a TCE that is a clone of the
r-ioba entry of the RTCE table TCE, then an additional
redirected mapping, if supported, is used else return
H_Resource with the value of the loop count in R4.Validate the liobn and ioba are not already mapped for
this entry else return H_IN_USE with the value of the
loop count in R4.Validate the number of redirected mappings for the
r-ioba does not exceed the
“ibm,max-rtce-mappings”
value for any of the adapters mapped by the RTCE else
return H_Resource with the value of the loop count in R4.Validate the number of redirected mappings for the r-ioba
does not exceed the per client IOBA value returned
from the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function
else return H_Resource with the value of the loop count in R4.Validate the new entry will not cause the number of
additional redirected mappings that have already been
made for this r-liobn to exceed the maximum retrieved by the H_VIOCTL
GET_MAX_REDIRECTED_MAPPINGS sub-function else return
H_Resource with the value of the loop count in R4.Copy the DMA address mapping from the r-ioba entry of the r-liobn
RTCE table to the ioba entry of the liobn TCE table and save a pointer to
the ioba entry of the liobn TCE table in the r-ioba entry of the r-liobn
RTCE table, or into a separate structure associated with the r-liobn RTCE
table.End Loop (The critical section lasts for one iteration of the
loop)Return H_SuccessImplementation Note: The PA requires the OS to issue
a sync instruction to proceed the signalling of an IOA to start an IO
operation involving DMA to guarantee the global visibility of both DMA
and TCE data. This hcall() does not include a sync instruction to
guarantee global visibility of TCE data and in no way diminishes the
requirement for the OS to issue it.Implementation Note: The execution time for this
hcall is expected to be a linear function of the count parameter.
Excessive size of the count parameter may cause an extended delay.H_REMOVE_RTCEThe H_REMOVE_RTCE hcall() is used to back-out TCEs built with
H_PUT_RTCE and H_PUT_RTCE_INDIRECT hcall()s. That is, to remove the TCEs
from the IOA TCE table and links put into the RTCE table as a result of
the H_PUT_RTCE or H_PUT_RTCE_INDIRECT hcall()s.Syntax:Parameters:r-liobn: Handle of RDMA RTCE tabler-ioba: IO address per RDMA RTCE tableliobn: Logical I/O Bus Number of server TCE tableioba: I/O address as seen by server IOAcount: Number of consecutive 4 KB pages to unmaptce-value: TCE value to be put into the IOA TCE(s) after setting
the “Page Mapping and Control” bits to “Page fault (no
access)”.Semantics:Validates r-liobn is from the second triple (second window pane)
of the server partition’s
“ibm,my-dma-window” property, else return
H_Parameter.Validates r-ioba plus (count * 4 KB) is within range of RTCE
table as specified by the window pane as specified by the r-liobn, else
return H_Parameter.Validates that the TCE table associated with liobn is owned by
calling partition, else return H_Parameter.If the Shared Logical Resource option is implemented and the
LIOBN, represents a logical resource that has been rescinded by the
owner, return H_RESCINDED.Validates that ioba plus (count * 4 KB) is within the range of
TCE table specified by liobn, else return H_Parameter.If the Shared Logical Resource option is implemented and the IOBA
represents a logical resource that has been rescinded by the owner,
return H_RESCINDED.For count entriesThe following is done in a critical section with respect to
updates the r-ioba entry of the RTCE table TCEIf it exists, invalidate the pointer in the r-ioba entry of the
r-liobn RTCE table (or in a separate structure associated with the
r-liobn RTCE table).Replace the ioba entry of the liobn TCE table with tce-value
after setting the “Page Mapping and Control” bits to
“Page fault (no access)”.End Loop (The critical section lasts for one iteration of the
loop)Return H_SuccessImplementation Note: The execution time for this hcall() is
expected to be a linear function of the count parameter. Excessive size
of the count parameter may cause an extended delay.Redirected RDMA TCE Recovery and In-Flight DMAThere are certain error or error recovery scenarios that may
attempt to unmap a TCE in an IOA’s TCE table prior to the
completion of the operation which setup the TCE. For example:A client attempts to H_PUT_TCE to its DMA window pane, which is
mapped to the second window pane of the server’s DMA window, and
the TCE in the RTCE table which is the target of the H_PUT_TCE already
points to a valid TCE in an IOA’s TCE table.A client attempts to H_FREE_CRQ and the server’s second
window pane for that virtual IOA contains a TCE which points to a valid
TCE in an IOA’s TCE table.A client partition attempts to reboot (which essentially is an
H_FREE_CRQ).A server attempts to H_FREE_CRQ and the server’s second
window pane for that virtual IOA contains a TCE which points to a valid
TCE in an IOA’s TCE table.In such error and error recovery situations, the hypervisor
attempts to prevent the changing of an IOAs TCE to a value that would
cause a non-recoverable IOA error. One method that the hypervisor may use
to accomplish this is that on a TCE invalidation operation, set the value
of the read and write enable bits in the TCE to allow DMA writes but not
reads, and to change the real page number in the TCE to target a dummy
page. In this case the IOA receives an error (Target Abort) on attempts
to read, while DMA writes (which were for a defunct operation) are
silently dropped. This works well when all the following are true:The platform supports separate TCE read and write enable bits in
the TCEEEH is enabled and the DD can recover from the MMIO Stopped and
DMA Stopped statesThe IOA and the IOA’s DD can recover gracefully from Target
Aborts (which are received on a read to a page where the read enable bit
is off)If these conditions are not true then the hypervisor will need to
try to prevent or delay invalidation of the TCEs. The H_Resource return
from the H_FREE_CRQ, H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE can
be used to hold-off the invalidation until which time the IOA can
complete the operation and the server can invalidate the IOA’s TCE.
In addition, the Bit Bucket Allowed LIOBN attribute and the
H_LIOBN_ATTRIBUTES hcall can be used to help enhance the recoverability
in these error scenarios (see
and
for more information).LIOBN AttributesThere are certain LIOBN attributes that are made visible to and can
be manipulated by partition software. The H_LIOBN_ATTRIBUTES hcall is
used to read and modify the attributes (see
).
defines the attributes that are
visible and manipulatable.
LIOBN AttributesBit(s)Field NameDefinition0-62Reserved63Bit Bucket Allowed1: For an indirect IOA TCE invalidation operation (that
is, via an operation other than an H_PUT_TCE directly to the
TCE by the partition owning the IOA), the platform may set the
value of the read and write enable bits in the TCE to allow DMA
writes but not reads and change the real page number in the TCE
to target a dummy page (the IOA receives an error (Target
Abort) on attempts to read, while DMA writes (which were for a
defunct operation) are silently dropped).0: The platform must reasonably attempt to prevent an
indirect (that is, via an operation other than an H_PUT_TCE
directly to the TCE by the partition owning the IOA)
modification an IOA’s valid TCE so that a possible
in-flight DMA does not cause a non-recoverable error.Software Implementation Notes:The results of changing this field when there are
valid TCEs for the LIOBN may produce unexpected results. The
hypervisor is not required to prevent such an operation.
Therefore, the H_LIOBN_ATTRIBUTES call to change the value of
this field should be made when there are no valid TCEs in the
table for the IOA.This field may be implemented but not changeable (the
actual value will be returned in R4 as a result of the
H_LIOBN_ATTRIBUTES hcall() regardless, with a status of
H_Constrained if not changeable).
H_LIOBN_ATTRIBUTESR1--1.If the H_LIOBN_ATTRIBUTES hcall is
implemented, then it must implement the attributes as they are defined in
and the syntax and semantics as
defined in
.R1--2.The H_LIOBN_ATTRIBUTES hcall must
ignore bits in the set-mask and reset-mask which are not implemented and
must process as an exception those which cannot be changed (H_Constrained
returned), and must return the following for the LIOBN Attributes in
R4:A value of 0 for unimplemented bit positions.The resultant field values for implemented fields.Syntax:Parameters:liobn: The LIOBN on which this Attribute modification is to be
performed.reset-mask: The bit-significant mask of bits to be reset in the
LIOBN’s Attributes (the reset-mask bit definition aligns with the
bit definition of the LIOBN’s Attributes, as defined in
). The complement of the
reset-mask is ANDed with the LIOBN’s Attributes, prior to applying
the set-mask. See semantics for more details on any field-specific
actions needed during the reset operations. If a particular field
position in the LIOBN Attributes is not implemented, then the
corresponding bit(s) in the reset-mask are ignored.set-mask: The bit-significant mask of bits to be set in the
LIOBN’s Attributes (the set-mask bit definition aligns with the bit
definition of the LIOBN’s Attributes, as defined in
). The set-mask is ORed with
the LIOBN’s Attributes, after to applying the reset-mask. See
semantics for more details on any field-specific actions needed during
the set operations. If a particular field position in the LIOBN
Attributes is not implemented, then the corresponding bit(s) in the
set-mask are ignored.Semantics:Validate that liobn belongs to the partition, else
H_Parameter.If the Bit Bucket Allowed field of the specified LIOBN’s
Attributes is implemented and changeable, then set it to the result
of:Bit Bucket Allowed field contents ANDed with the complement of the
corresponding bits of the reset-mask and then ORed with the corresponding
bits of the set-mask.Load R4 with the value of the LIOBN’s Attributes, with any
unimplemented bits set to 0, and if all requested changes were made, then
return H_Success, otherwise return H_Constrained.Extensions to Other hcall()s for Redirected
RDMAH_PUT_TCE, H_PUT_TCE_INDIRECT, and
H_STUFF_TCEThese hcall()s are only valid for the first window pane of the
“ibm,my-dma-window” property. See
for information about window
pane types.The following are extensions that apply to the H_PUT_TCE,
H_PUT_TCE_INDIRECT, and H_STUFF_TCE hcall()s in their use against an RTCE
table.Recognize the validated (owned by the calling partition, else
H-Parameter) LIOBN as referring to a RTCE table (first window pane) and
access accordingly:If the TCE is not from the first triple (first window pane) of
the calling partition’s
“ibm,my-dma-window” property, return
H_Parameter.If the TCE is not currently in use: Clear/invalidate the TCE copy
pointer and enter the TCE mapping per the input parameters to the
hcall().If the TCE contains a valid mapping and the TCE copy pointer is
invalid: Enter the TCE mapping per the input parameters to the
hcall().If the TCE contains a valid mapping and the TCE copy pointer
references a TCE that does not contain a valid copy of the previous
mapping in the TCE: Clear/invalidate the TCE copy pointer and enter the
TCE mapping per the input parameters to the hcall().If the TCE contains a valid mapping and the TCE copy pointer
references a TCE that does contain a valid copy of the previous mapping
in the TCE, then:If the Bit Bucket Allowed Attribute of the LIOBN containing the
TCE is a 1, invalidate the copied TCE and enter the TCE mapping per the
input parameters to the hcall().If the Bit Bucket Allowed Attribute of the LIOBN containing the
TCE is a 0, then return H_Resource or perform some other
platform-specific error recovery.H_MIGRATE_DMACheck that the pages referenced by the TCEs specified in the
mappings to be migrated belong to the calling partition, else
H_Parameter.If the mapping being migrated is via an RTCE table (that is, LIOBN
points to an RTCE table), then follow the valid redirected TCE pointer
and migrate the redirected page (if the redirected TCE mapping is still a
clone of the original RTCE table entry).If the mapping being migrated is via an RTCE table and if the RTCE
table TCEs were built with the H_MASS_MAP_TCE hcall(), then expand each
mass mapped area into smaller 4 KB granularities, as necessary to avoid
performance and locking issues, during the migration process.Insert checking, and potentially delays, to allow IOAs to make
forward progress between successive DMA disables caused by multiple
partner partitions making simultaneous uncoordinated calls to
H_MIGRATE_DMA targeting the same IOA.Subordinate Command/Response Queue (Sub-CRQ)The Sub-CRQ facility is used in conjunction with the CRQ facility,
for some virtual IOA types, when more than one queue is needed for the
virtual IOA. For information on the CRQ facility, see
. For information on which
virtual IOAs may use the Sub-CRQ facilities, see the applicable sections
for the virtual IOAs. See
for a comparison of the
differences in the queue structures between CRQs and Sub-CRQs. In
addition to the hcall()s specified in
, all of the following hcall()s
and RTAS calls are applicable to both CRQs and Sub-CRQs:H_XIRRH_EOIibm,int-onibm,int-offibm,set-xiveibm,get-xive
CRQ and Sub-CRQ ComparisonCharacteristicCRQSub-CRQQueue entry size1632Transport and initialization eventsApplicableNot applicable (coordinated through the CRQ that is
associated with the Sub-CRQ)RegistrationH_REG_CRQH_REG_SUB_CRQDeregistrationH_FREE_CRQH_FREE_SUB_CRQNote: H_FREE_CRQ for the associated CRQ implicitly
deregisters the associated Sub-CRQsEnableH_ENABLE_CRQNot applicableInterrupt numberObtained from
“interrupts” propertyObtained from H_REG_SUB_CRQInterrupt enable/disableH_VIO_SIGNALH_VIOCTL subfunctionFor virtual IOAs that define the use of Sub-CRQs, the
interrupt associated with the CRQ, as defined by the
“interrupts” property in the
OF device tree, may be enabled or disabled with either the
H_VIOCTL or the H_VIO_SIGNAL hcall(). The CRQ interrupt
associated with a CRQ of a virtual IOA that does not define
the use of Sub-CRQs should be enabled and disabled by use of
the H_VIO_SIGNAL hcall().hcall() used to place entry on queueH_SEND_CRQH_SEND_SUB_CRQH_SEND_SUB_CRQ_INDIRECTNumber of queues per virtual IOAOneZero or more, depending on virtual IOA architecture,
implementation, and client/server negotiation
Sub-CRQ Format and RegistrationEach Sub-CRQ is built of one or more 4 KB pages aligned on a 4 KB
boundary within partition memory, and is organized as a circular buffer
of 32 byte long elements. Each queue is mapped into contiguous I/O
addresses via the TCE mechanism and RTCE table (first window pane). The
I/O address and length of each queue is registered by the process defined
in
. This registration process
tells the hypervisor where to find the virtual IOA’s
Sub-CRQ(s).Sub-CRQ Entry FormatEach Sub-CRQ entry consists of a 32 byte element. The first byte of
a Sub-CRQ entry is the Header byte and is defined in
.
Sub-CRQ Entry Header Byte ValuesHeader ValueDescription0Element is unused -- all other bytes in the element are
undefined.0x01 - 0x7FReserved.0x80Valid Command/Response entry.0x81 - 0xFFReserved.
The platform (transport mechanism) ignores the contents of all
non-header bytes in all Sub-CRQ entries.The operational state of any Sub-CRQs follows the operational state
of the CRQ to which the Sub-CRQ is associated. That is, the CRQ transport
is required to be operational in order for any associated Sub-CRQs to be
operational (for example, if an H_SEND_CRQ hcall() would not succeed due
to any reason other than lack of space is available in the CRQ, then an
H_SEND_SUB_CRQ or H_SEND_SUB_CRQ_INDIRECT hcall() to the associated
Sub-CRQ would also fail). Hence, the Sub-CRQ transport does not implement
the transport and initialization events that are implemented by the CRQ
facility.Sub-CRQ Entry ProcessingDuring the Sub-CRQ registration (H_REG_SUB_CRQ), the platform
firmware sets all the header bytes of the Sub-CRQ being registered to
zero (entry invalid). After registration, the first valid entry is placed
in the first element and the process proceeds to the end of the queue and
then wraps around to the first entry again (given that the entry has been
subsequently marked as invalid). This allows both the partition software
and transport firmware to maintain independent pointers to the next
element they will be respectively using.A sender uses an H_SEND_SUB_CRQ hcall() to enter one 32 byte
message on its partner’s Sub-CRQ. Prior to enqueueing an entry on
the Sub-CRQ, the platform first checks if the session to the
partner’s associate CRQ is open, and there is a enough free space
on the Sub-CRQ, if not, it returns an error. If the checks succeed, the
contents of the message is copied into the next free queue element,
potentially notifying the receiver, and returns a successful status to
the caller. The caller may also insert more than one entry on the queue
with one hcall() using H_SEND_SUB_CRQ_INDIRECT. Use of this hcall()
requires that there be enough space on the queue for all the entries,
otherwise none of the entries are placed onto the Sub-CRQ.At the receiver’s option, it may be notified via an interrupt
when an element is enqueued to its Sub-CRQ. See
.When the receiver has finished processing a Sub-CRQ entry, it
writes the header to the value 0x00 to invalidate the entry and free it
for future entries.Should the receiver wish to terminate or reset the communication
channel it deregisters the Sub-CRQ (H_FREE_SUB_CRQ), and if it needs to
re-establish communications, proceeds to register (H_REG_SUB_CRQ) either
the same or different section of memory as the new queue, with the queue
pointers reset to the first entry. Deregistering a CRQ (H_FREE_CRQ) is an
implicit deregistration of any Sub-CRQs associated with the CRQ.Sub-CRQ Facility Interrupt NotificationThe receiver can set the virtual interrupt associated with its
Sub-CRQ to one of two modes. These are:Disabled (an enqueued interrupt is not signaled).Enabled (an enqueued interrupt is signaled on every
enqueue).Note: An enqueue is considered a pulse not a level.
The pulse then sets the memory element within the emulated interrupt
source controller. This allows the resetting of the interrupt condition
by simply issuing the H_EOI hcall() as is done with the PCI MSI
architecture rather than having to do an explicit interrupt reset as in
the case with PCI Level Sensitive Interrupt (LSI) architecture.The interrupt mechanism is capable of presenting only one interrupt
signal at a time from any given interrupt source. Therefore, no
additional interrupts from a given source are ever signaled until the
previous interrupt has been processed through to the issuance of an H_EOI
hcall(). Specifically, even if the interrupt mode is enabled, the effect
is to interrupt on an empty to non-empty transition of the queue.
However, as with any asynchronous posting operation race conditions are
to be expected. That is, an enqueue can happen in a window around the
H_EOI hcall(). Therefore, the receiver should poll the Sub-CRQ (that is,
look at the header byte of the next queue entry to see if the entry is
valid) after an H_EOI to prevent losing initiative.The hcall() used to enable and disable this Sub-CRQ interrupt
notification is H_VIO_SIGNAL (see
).Extensions to Other hcall()s for Sub-CRQH_MIGRATE_DMASince Sub-CRQs are RTCE table mapped, the H_MIGRATE_DMA hcall() may
be requested to move a page that is part of a Sub-CRQ. The OS owner of
the queue is responsible for preventing its processors from modifying the
page during the migrate operation (as is standard practice with this
hcall()), however, the H_MIGRATE_DMA hcall() serializes with the Sub-CRQ
hcall()s to direct new elements to the migrated target page.H_XIRR, H_EOIThe Sub-CRQ facility utilizes a virtual interrupt source number to
notify the queue owner of new element enqueues. The standard H_XIRR and
H_EOI hcall()s are extended to support this virtual interrupt mechanism,
emulating the standard PowerPC Interrupt hardware with respect to the
virtual interrupt source number.Sub-CRQ Facility RequirementsR1--1.For the Sub-CRQ facility: The platform must implement
the Sub-CRQ as specified in
.R1--2.For the Sub-CRQ facility:
The platform must
start enqueueing Commands/Responses to the newly registered Sub-CRQ
starting at offset zero and proceeding as in a circular buffer, each
entry being 32 byte aligned.R1--3.For the Sub-CRQ facility:
The platform must
enqueue Commands/Responses only if the 32 byte entry is free (header byte
contains 0x00), else the enqueue operation fails.R1--4.For the Sub-CRQ facility: The first byte of a Sub-CRQ
entry must be the Header byte and must be as defined in
.R1--5.For the Sub-CRQ facility option: Platforms that
implement the H_MIGRATE_DMA hcall() must implement that function for
pages mapped for use by the Sub-CRQ.R1--6.For the Sub-CRQ facility: The platforms must emulate
the standard PowerPC External Interrupt Architecture for the interrupt
source numbers associated with the virtual devices via the standard RTAS
and hypervisor interrupt calls and must extend H_XIRR and H_EOI hcall()s
as appropriate for Sub-CRQ interrupts.Partition Managed Class - Synchronous
InfrastructureThe architectural intent of the Synchronous VIO infrastructure is
for platforms where the communicating partitions are under the control of
the same hypervisor. Operations between the partitions are via
synchronous hcall() operations. The Synchronous VIO infrastructure
defines three options:Reliable Command/Response Transport option (see
Subordinate CRQ Transport option (see
Logical Remote DMA (LRDMA) option (see
)Reliable Command/Response Transport OptionFor the synchronous infrastructure, the CRQ facility defined in
is
implemented via the Reliable Command/Response Transport
option. The synchronous nature of this infrastructure allows for the
capability to immediately (synchronously) notify the sender of the
message whether the message was delivered successfully or not.Reliable CRQ Format and RegistrationThe format of the CRQ is as defined in
.The I/O address and length of the queue are registered using the
H_REG_CRQ hcall().
.Reliable CRQ Entry FormatSee
.Reliable CRQ Entry ProcessingA sender uses the H_SEND_CRQ hcall() to enter a 16 byte message on
its partner’s queue. The hcall() takes the entire message as input
parameters in two registers.
.Reliable Command/Response Transport Interrupt
NotificationThe receiver can enable and disable the virtual interrupt
associated with its CRQ. See
.Reliable Command/Response Transport hcall()sThe H_REG_CRQ and H_FREE_CRQ hcall()s are used by both client and
server virtual IOA device drivers. It is the architectural intent that
the hypervisor maintains a connection control structure for each defined
partner/server connection. The H_REG_CRQ and its corresponding H_FREE_CRQ
register and deregister partition resources with that connection control
structure. However, there are several conditions that can arise
architecturally with this connection process (the design of an
implementation may preclude some of these conditions).The association connection to the partner virtual IOA not being
defined (H_Not_Found). The CRQ registration function fails if the CRQ is
not registered with the hypervisor.The partner virtual IOA may not have registered its CRQ
(H_Closed). The CRQ is registered with the hypervisor and the connection.
However, the connection is incomplete because their partner has not
registered.The partner virtual IOA may be already connected to another
partner virtual IOA (H_Resource). The CRQ registration function fails if
the CRQ is not registered with the hypervisor or the connection.The reaction of the virtual IOA device driver to these conditions
is somewhat different depending upon the calling device driver being for
a client or server IOA. Server IOAs in many cases register prior to their
partner IOAs since they are servers and subsequently wait for service
requests from their clients. Therefore, the H_Closed return code is to be
expected when the DD’s CRQ has been registered with the connection
and is just waiting for the partner to register. Should a partner DD
register its CRQ in the future, higher level protocol messages (via the
Initialization Command/Response CRQ entry) can notify the server DD when
the connection is established. If a client IOA registers and receives a
return code of H_Closed, it may choose to deregister the CRQ and fail
since the client IOA would not be in a position to successfully send
service requests using the CRQ facility, or it may wait and rely upon
higher level CRQ messages (via the Initialization Command/Response CRQ
entry) to tell it when its partner has registered. The reaction of a
virtual IOA DDs to H_Not_Found and H_Resource are dependent upon the
functionality of higher level platform and system management policies.
While the current registration has failed, higher level system and or
platform management actions may allow a future registration request to
succeed.When registration succeeds, an association is made between the
partner partition’s LIOBN (RTCE table) and the second window pane
of the server partition. This association is dropped when either partner
deregisters or terminates. However, on deregistration or termination, the
RTCE tables associated with the local partition (first window pane)
remain intact for that partition (see Requirement
).H_REG_CRQThis hcall() registers the RTCE table mapped memory that contains
the CRQ.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” propertyqueue: I/O address (offset into the RTCE table) of the CRQ buffer
(starting on a 4 KB boundary).len: Length of the CRQ in bytes (a multiple of 4 KB)Semantics:Validate unit-address, else H_ParameterValidate queue, which is the I/O address of the CRQ (I/O
addresses for entire buffer length starting at the specified I/O address
are translated by the RTCE table, is 4 KB aligned, and length, len, is a
multiple of 4 KB), else H_ParameterValidate that there is an authorized connection to another
partition associated with the Unit Address, else H_Not_Found.Validate that the authorized connection to another partition
associated with the Unit Address is free, else H_Resource.Initialize the CRQ enqueue pointer and length variables. These
variables are kept in terms of I/O addresses so that page migration works
and any remapping of TCEs is effective.Disable CRQ interrupts.Allow for Logical Remote DMA, when applicable, with associated
partner partition when partner registers.If partner is already registered, then return H_Success, else
return H_Closed.H_FREE_CRQThis hcall() deregisters the RTCE table mapped memory that contains
the CRQ. In addition, if there are any Sub-CRQs associated with the CRQ,
the H_FREE_CRQ has the effect of releasing the Sub-CRQs.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” propertySemantics:Validate unit-address, else H_ParameterMark the connection to the associated partner partition as closed
(so that send hcall()s from the partner partition fail).Mark the CRQ enqueue pointer and length variables as
invalid.For any and all Sub-CRQs associated with the CRQ, do the
following:Mark the connection to the associated partner partition as closed
for the Sub-CRQ (so that send hcall()s from the partner partition
fail).Mark the Sub-CRQ enqueue pointer and length variables for the
Sub-CRQ as invalid.Disable Sub-CRQ interrupts for the Sub-CRQ.Disable CRQ interrupts.If there exists any Redirected TCEs in the local TCE tables
associated with this Virtual IOA, and all of those tables have a Bit
Bucket Allowed attribute of 1, then Disable Logical Remote DMA with
associated partner partition, if enabled, invalidating any Redirected
TCEs in the local TCE tables (for information on invalidation of TCEs,
see ).If there exists any Redirected TCEs in the local TCE tables
associated with this Virtual IOA, and any of those tables have a Bit
Bucket Allowed attribute of 0, then return H_Resource or perform some
other platform-specific error recovery.Send partner terminated message to partner queue (if it is still
registered), overlaying the last valid entry in the queue if the CRQ is
full.Return H_Success.Implementation Note: If the hypervisor returns an
H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must
call H_FREE_CRQ again with the same parameters. Software may choose to
treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same as H_Busy.
The hypervisor, prior to returning H_Busy, H_LongBusyOrder1mSec, or
H_LongBusyOrder10mSec, will have placed the virtual adapter in a state
that will cause it to not accept any new work nor surface any new virtual
interrupts (no new entries will be placed on the CRQ).H_SEND_CRQThis hcall() sends one 16 byte entry to the partner
partition’s registered CRQ.Syntax:Parameters:unit-addr: Unit Address per device tree node
“reg” propertymsg-high:header: high order bit is on -- header of value 0xFF is reserved
for transport error and is invalid for input.format: not checked by the firmware.msg-low: not checked by the firmware -- should be consistent with
the definition of the format byte.Semantics:Validate the Unit Address, else return H_ParameterValidate that the msg header byte has its high order bit on and
that it is not = 0xFF, else return H_Parameter.Validate that there is an authorized connection to another
partition associated with the Unit Address and that the associated CRQ is
enabled, else return H_Closed.Enter Critical Section on target CRQValidate that there is room on the receive queue for the message
and allocate that message, else exit critical Section and return
H_Dropped.Store msg-low into the second 8 bytes of the allocated queue
element.Store order barrierStore msg-high into the first 8 bytes of the allocated queue
element (setting the header valid bit.)Exit Critical SectionIf receiver queue interrupt mode == enabled, then signal
interruptReturn H_Success.H_ENABLE_CRQThis hcall() explicitly enables a CRQ that has been disabled due to
a Partner partition suspended transport event. As a side effect of this
hcall(), all pages that are mapped via the logical TCE table associated
with the first pane of
“ibm,my-dma-window” property of the
associated virtual IOA are restored prior to successful completion of the
hcall(). It is the architectural intent that this hcall() is made while
the logical TCE contains mappings for all the pages that will be involved
in the recovery of the outstanding I/O operations at the time of the
partition migration. Further, it is the architectural intent that this
hcall() is made from a processing context that can handle the expected
busy wait return code without blocking the processor.Syntax:Parameters:unit-addr: Unit Address per device tree node
“reg” propertySemantics:Validate the Unit Address, else return H_ParameterTest that all pages mapped through the logical TCE table
associated with the first pane of the
“ibm,my-dma-window” property associated
with the unit-address parameter are present; else return
H_LongBusyOrder10mSec.Set the status of the CRQ associated with the unit-address
parameter to enabled.Return H_Success.Reliable Command/Response Transport Option
RequirementsR1--1.For the Reliable Command/Response Transport
option: The platform must implement the CRQ facility, as
defined in
.R1--2.For the Reliable Command/Response Transport
option: The platform must implement the H_REG_CRQ hcall().
.R1--3.For the Reliable Command/Response Transport
option: The platform must implement the H_FREE_CRQ hcall().
.R1--4.For the Reliable Command/Response Transport
option: The platform must implement the H_SEND_CRQ hcall().
.R1--5.For the Reliable Command/Response Transport
option: The platform must implement the H_ENABLE_CRQ hcall().
.Logical Remote DMA (LRDMA) OptionThe Logical Remote Direct Memory Access (LRDMA) option allows a
server partition to securely target memory pages within a partner
partition for VIO operations.This architecture defines two modes of RDMACopy RDMA is used to have the hypervisor copy data
between a buffer in the server partition’s memory and a buffer in
the partner partition’s memory. See
for more information on Copy
RDMA with respect to LRDMA.Redirected RDMA allows for a server partition to
securely target its I/O adapter's DMA operations directly at the memory
pages of the partner partition. The platform overhead of Copy RDMA is
generally greater than Redirected RDMA, but this overhead may be offset
if the server partition’s DMA buffer is being used as a data cache
for multiple VIO operations. See
for more information on
Redirected RDMA with respect to LRDMA.The mapping between the LIOBN in the second pane of a server
virtual IOA’s
“ibm,my-dma-window” property and the
corresponding partner IOA’s RTCE table is made when the CRQ
successfully completes registration. The partner partition is not aware
if the server partition is using Copy RDMA or Redirected RDMA. The server
partition uses the Logical RDMA mode that best suits its needs for a
given VIO operation. See
for more information on RTCE
tables.Copy RDMAThe Copy RDMA hcall()s are used to request that the hypervisor move
data between partitions. The specific implementation is optimized to the
platform’s hardware features. There are calls for when both source
and destination buffers are RTCE table mapped (H_COPY_DMA) and when only
the remote buffers are mapped (H_WRITE_RDMA and H_READ_RDMA).H_COPY_RDMAThis hcall() copies data from an RTCE table mapped buffer in one
partition to an RTCE table mapped buffer in another partition, with the
length of the transfer being specified by the transfer length parameter
in the hcall(). The
“ibm,max-virtual-dma-size” property, if
it exists (in the
/vdevice (node), specifies the maximum length of the
transfer (minimum value of this property is 128 KB).Syntax:Parameters:len: Length of transfer (length not to exceed the value in the
“ibm,max-virtual-dma-size” property, if
it exists)s-liobn: LIOBN (RTCE table handle) of V-DMA source buffers-ioba: IO address of V-DMA source bufferd-liobn: LIOBN (RTCE table handle) of V-DMA destination
bufferd-ioba: I/O address of V-DMA destination bufferSemantics:Serialize access to RTCE tables with H_MIGRATE_DMA.If the
“ibm,max-virtual-dma-size” property exist
in the
/vdevice node of the device tree, then if the value
of len is greater than the value of this property, return
H_Parameter.Source and destination LIOBNs are checked for authorization per
the
“ibm,my-dma-window” property, else return
H_S_Parm or H_D_Parm, respectively.Source and destination ioba’s and length are checked for
valid ranges per the
“ibm,my-dma-window” property, else return
H_S_Parm or H_D_Parm, respectively.The access bits of the associated TCEs are checked for
authorization, else return H_Permission.Copy len number of bytes from the buffer starting at the
specified source address to the buffer starting at the specified
destination address, then return H_Success.H_WRITE_RDMAThis hcall() copies up to 48 bytes of data from a set of input
parameters to an RTCE table mapped buffer in another partition.Syntax:Parameters:len: Length of transferd-liobn: LIOBN (RTCE table handle) of V-DMA destination
bufferd-ioba: I/O address of V-DMA destination bufferdata1: Source datadata2: Source datadata3: Source datadata4: Source datadata5: Source datadata6: Source dataSemantics:Check that the len parameter => 0 and <= 48, else return
H_ParameterThe destination LIOBN is checked for authorization per the remote
triple of the one of the calling partition’s
“ibm,my-dma-window” property, else return
H_D_Parm.The destination ioba and length are check for valid ranges per
the remote triple of the one of the calling partition’s
“ibm,my-dma-window” property, else return
H_D_Parm.Serialize access to the destination RTCE table with
H_MIGRATE_DMA.The access bits of the associated RTCE table TCEs are checked for
authorization, else return H_Permission.Copy len number of bytes from the data parameters starting at the
high order byte of data1 toward the low order byte of data 6 into the
buffer starting at the specified destination address, then return
H_Success.H_READ_RDMAThis hcall() copies up to 72 bytes of data from an RTCE table
mapped buffer into a set of return registers.Syntax:Parameters:len: Length of transfers-liobn: LIOBN (RTCE table handle) of V-DMA source buffers-ioba: IO address of V-DMA source bufferSemantics:Check that the len parameter => 0 and <= 72, else return
H_ParameterThe source LIOBN is checked for authorization per the remote
triple of the one of the calling partition’s
“ibm,my-dma-window” property, else return
H_S_Parm.The source ioba and length are check for valid ranges per the
remote triple of the one of the calling partition’s
“ibm,my-dma-window” property, else return
H_S_Parm.Serialize access to the source RTCE table with
H_MIGRATE_DMA.The access bits of the associated RTCE table TCEs are checked for
authorization, else return H_Permission.Copy len number of bytes from the source data buffer specified by
s-liobn starting at s-ioba, into the registers R4 through R12 starting
with the high order byte of R4 toward the low order byte of R12, then
return H_Success.Logical Remote DMA Option RequirementsR1--1.For the Logical Remote DMA option: The platform must
implement the H_PUT_RTCE hcall() as specified in
.R1--2.For the Logical Remote DMA option: The platform must
implement the extensions to the H_PUT_TCE hcall() as specified in
.R1--3.For the Logical Remote DMA option: The platform must
implement the extensions to the H_MIGRATE_DMA hcall() as specified in
.R1--4.For the Logical Remote DMA option: The platform must
implement the H_COPY_RDMA hcall() as specified in
.R1--5.For the Logical Remote DMA option:
The platform must
disable Logical Remote DMA operations that target an inactive partition
(one that has terminated), including the H_COPY_RDMA hcall() and the
H_PUT_RTCE hcall().Implementation Note: It is expected that as part of
meeting Requirement
, all of the terminating
partition’s TCE table entries (regular and RTCE) are invalidated
along with any clones (for information on invalidation of TCEs, see
). While other mechanisms are
available for meeting this requirement in the case of H_COPY_RDMA, this
is the only method for Redirected RDMA, and since it works in both cases,
it is expected that implementations will use this single
mechanism.Subordinate CRQ Transport OptionFor the synchronous infrastructure, in addition to the CRQ facility
defined in
,
the Subordinate CRQ Transport option may also be implemented
in conjunction with the CRQ facility. That is, the Subordinate CRQ
Transport option requires that the Reliable Command/Response Transport
option also be implemented. For this option, the Sub-CRQ facility defined
in
is
implemented.Sub-CRQ Format and RegistrationThe format of the Sub-CRQ is as defined in
.The I/O address and length of the queue are registered using the
H_REG_SUB_CRQ hcall().
.Sub-CRQ Entry FormatSee
.Sub-CRQ Entry ProcessingA sender uses the H_SEND_SUB_CRQ or H_SEND_SUB_CRQ_INDIRECT hcall()
to enter one or more 32 byte messages on its partner’s queue.
and
.Sub-CRQ Transport Interrupt NotificationThe receiver can enable and disable the virtual interrupt
associated with its Sub-CRQ using the H_VIOCTL hcall(), with the
appropriate subfunction. See
. The interrupt number that is
used in the H_VIOCTL call is obtained from the H_REG_SUB_CRQ call that is
made to register the Sub-CRQ.Sub-CRQ Transport hcall()sThe H_REG_SUB_CRQ and H_FREE_SUB_CRQ hcall()s are used by both
client and server virtual IOA device drivers. It is the architectural
intent that the hypervisor maintains a connection control structure for
each defined partner/server connection. The H_REG_SUB_CRQ and its
corresponding H_FREE_SUB_CRQ register and deregister partition resources
with that connection control structure. However, there are several
conditions that can arise architecturally with this connection process
(the design of an implementation may preclude some of these
conditions).The association connection to the partner virtual IOA not being
defined (H_Not_Found).The partner virtual IOA CRQ connection may not have been
completed (H_Closed).The partner may deregister its CRQ which also deregisters any
associated Sub-CRQs.H_REG_SUB_CRQThis hcall() registers the RTCE table mapped memory that contains
the Sub-CRQ. Multiple Sub-CRQ registrations may be attempted for each
virtual IOA. If resources are not available to establish a Sub-CRQ, the
H_REG_SUB_CRQ call will fail with H_Resource.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this hcall() might change, and the caller should be prepared to receive
an H_Function return code indicating the platform does not implement this
hcall(). If a virtual IOA exists in the device tree after migration that
requires by this architecture the presence of this hcall(), then if that
virtual IOA exists after the migration, it can be expected that the
hcall() will, also.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” property.Sub-CRQ-ioba: I/O address (offset into the RTCE table, as
specified by the first window pane of the virtual IOA’s
“ibm,my-dma-window” property) of the
Sub-CRQ buffer (starting on a 4 KB boundary).Sub-CRQ-length: Length of the Sub-CRQ in bytes (a multiple of 4
KB).Semantics:Validate unit-address, else H_Parameter.Validate Sub-CRQ-ioba, which is the I/O address of the Sub-CRQ
(I/O addresses for entire buffer length starting at the specified I/O
address are translated by the RTCE table, is 4 KB aligned, and length,
Sub-CRQ-length, is a multiple of 4 KB), else H_Parameter.Validate that there are sufficient resources associated with the
Unit Address to allocate the Sub-CRQ, else H_Resource.Initialize the Sub-CRQ enqueue pointer and length variables.
These variables are kept in terms of I/O addresses so that page migration
works and any remapping of TCEs is effective.Initialize all Sub-CRQ entry header bytes to 0 (invalid).Disable Sub-CRQ interrupts.Place cookie representing Sub-CRQ number (will be used in
H_SEND_SUB_CRQ, H_SEND_SUB_CRQ_INDIRECT, and H_FREE_SUB_CRQ) in
R4.Place interrupt number (the same as will be returned by H_XIRR or
H_IPOLL for the interrupt from this Sub-CRQ) in R5.If the CRQ connection is already complete, then return H_Success,
else return H_Closed.H_FREE_SUB_CRQThis hcall() deregisters the RTCE table mapped memory that contains
the Sub-CRQ. Note that the H_FREE_CRQ hcall() also deregisters any
Sub-CRQs associated with the CRQ being deregistered by that
hcall().Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this hcall() might change, and the caller should be prepared to receive
an H_Function return code indicating the platform does not implement this
hcall(). If a virtual IOA exists in the device tree after migration that
requires by this architecture the presence of this hcall(), then if that
virtual IOA exists after the migration, it can be expected that the
hcall() will, also.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” property.Sub-CRQ-num: The queue # cookie returned from H_REG_SUB_CRQ
hcall() at queue registration time.Semantics:Validate unit-address and Sub-CRQ-num, else H_ParameterMark the connection to the associated partner partition as closed
for the specified Sub-CRQ (so that send hcall()s from the partner
partition fail).Mark the Sub-CRQ enqueue pointer and length variables for the
specified Sub-CRQ as invalid.Disable Sub-CRQ interrupts for the specified Sub-CRQ.Return H_Success.H_SEND_SUB_CRQThis hcall() sends one 32 byte entry to the partner
partition’s registered Sub-CRQ.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this hcall() might change, and the caller should be prepared to receive
an H_Function return code indicating the platform does not implement this
hcall(). If a virtual IOA exists in the device tree after migration that
requires by this architecture the presence of this hcall(), then if that
virtual IOA exists after the migration, it can be expected that the
hcall() will, also.Syntax:Parameters:unit-addr: Unit Address per device tree node
“reg” property.Sub-CRQ-num: The queue # cookie returned from H_REG_SUB_CRQ
hcall() at queue registration time.msg-dword0: firmware checks only high order byte.msg-dword1, msg-dword2, msg-dword3: the rest of the message;
firmware does not validate.Semantics:Validate the Unit Address, else return H_Parameter.Validate that the Sub-CRQ, as specified by Sub-CRQ-num, is
properly registered by the partner, else return H_Parameter.Validate that the message header byte (high order byte of
msg-dword0) is 0x80, else return H_Parameter.Validate that there is an authorized CRQ connection to another
partition associated with the Unit Address and that the associated CRQ is
enabled, else return H_Closed.Enter Critical Section on target Sub-CRQ.Validate that there is room on the specified Sub-CRQ for the
message and allocate that message, else exit critical Section and return
H_Dropped.Store msgdword1 into bytes 4-7 of the allocated queue
element.Store msgdword2 into bytes 8-11 of the allocated queue
element.Store msgdword3 into bytes 12-15 of the allocated queue
element.Store order barrier.Store msgdword0 into bytes 0-3 of the allocated queue element
(this sets the valid bit in the header byte).Exit Critical Section.If receiver queue interrupt mode is enabled, then signal
interrupt.Return H_Success.H_SEND_SUB_CRQ_INDIRECTThis hcall() sends one or more 32 byte entries to the partner
partition’s registered Sub-CRQ. On H_Success, all of the entries
have been put onto the Sub-CRQ. On any return code other than H_Success,
none of the entries have been put onto the Sub-CRQ.Programming Note: On platforms that implement the
partition migration option, after partition migration the support for
this hcall() might change, and the caller should be prepared to receive
an H_Function return code indicating the platform does not implement this
hcall(). If a virtual IOA exists in the device tree after migration that
requires by this architecture the presence of this hcall(), then if that
virtual IOA exists after the migration, it can be expected that the
hcall() will, also. The maximum num-entries has increased on some platforms
from 16 to 128. On platforms that implement the partition migration option,
after partition migration the support for this hcall() might change, and the
caller should be prepared to receive an H_Parameter return code in the situation
where more than 16 num-entries have been sent, indicating the platform does not
support more than 16 num-entries.Syntax:Parameters:unit-addr: Unit Address per device tree node
“reg” property.Sub-CRQ-num: The Sub-CRQ # cookie returned from H_REG_SUB_CRQ
hcall() at queue registration time.ioba: The address of the TCE-mapped page which contains the
entries to be placed onto the specified Sub-CRQ.num-entries: Number of entries to be placed onto the specified
Sub-CRQ from the TCE mapped page starting at ioba (maximum number of
entries is 16 in order to minimize the hcall() time).Semantics:Validate the Unit Address, else return H_Parameter.Validate that the Sub-CRQ, as specified by Sub-CRQ-num, is
properly registered by the partner, else return H_Parameter.If ioba is outside of the range of the calling partition assigned
values, then return H_Parameter.If num-entries is not in the range of 1 to 128, then return
H_Parameter.If num-entries is not in the range of 1 to 16, then return
H_Parameter.Validate that there is an authorized CRQ connection to another
partition associated with the Unit Address and that the associated CRQ is
enabled, else return H_Closed.Copy (num-entries * 32) bytes from the page specified starting at
ioba to a temporary hypervisor page for contents verification and
processing (this avoids the problem of the caller changing call by
reference values after they are checked).Validate that the message header bytes for num-entries starting
at ioba are 0x80, else return H_Parameter.Enter Critical Section on target Sub-CRQ.Validate that there is room on the specified Sub-CRQ for
num-entries messages and allocate those messages, else exit critical
Section and return H_Dropped.For each of the num-entries starting at iobaStore entry bytes 1-31 into bytes 1-31 of the allocated queue
element.Store order barrier.Store entry byte 0 into bytes 0 of the allocated queue element
(this sets the valid bit in the header byte).LoopExit Critical Section.If receiver queue interrupt mode is enabled, then signal
interrupt.Return H_Success.Subordinate CRQ Transport Option RequirementsR1--1.For the Subordinate CRQ Transport option: The
platform must implement the Reliable Command/Response Transport option,
as defined in
.R1--2.For the Subordinate CRQ Transport option: The
platform must implement the Sub-CRQ facility, as defined in
.R1--3.For the Subordinate CRQ Transport option: The
platform must implement the H_REG_SUB_CRQ hcall().
.R1--4.For the Subordinate CRQ Transport option: The
platform must implement the H_FREE_SUB_CRQ hcall().
.R1--5.For the Subordinate CRQ Transport option: The
platform must implement the H_SEND_SUB_CRQ hcall().
.R1--6.For the Subordinate CRQ Transport option: The
platform must implement the H_SEND_SUB_CRQ_INDIRECT hcall().
.R1--7.For the Subordinate CRQ Transport option: The
platform must implement all of the following subfunctions of the H_VIOCTL
hcall() (
):DISABLE_ALL_VIO_INTERRUPTSDISABLE_VIO_INTERRUPTENABLE_VIO_INTERRUPTInterpartition Logical LAN (ILLAN) OptionThe Interpartition Logical LAN (ILLAN) option provides the
functionality of IEEE VLAN between LPAR partitions. Partitions are
configured to participate in the ILLAN. The participating partitions have
one or more logical IOAs in their device tree.The hypervisor emulates the functionality of an IEEE VLAN switch.
That functionality is defined in IEEE 802.1Q. The following information on
IEEE VLAN switch functionality is provided for informative reference only
with the referenced document being normative. Logical Partitions may have
one or more Logical LAN IOA’s each of which appears to be connected
to one and only one Logical LAN Switch port of the single Logical LAN
Switch implemented by the hypervisor. Each Logical LAN Switch port is
configured (by platform dependent means) as to whether the attached Logical
LAN IOA supports IEEE VLAN headers or not, and the allowable VLAN numbers
that the port may use (a single number if VLAN headers are not supported,
an implementation dependent number if VLAN headers are supported). When a
message arrives at a Logical LAN Switch port from a Logical LAN IOA, the
hypervisor caches the message’s source MAC address (2nd 6 bytes) to
use as a filter for future messages to the IOA. Then the hypervisor
processes the message differently depending upon whether the port is
configured for IEEE VLAN headers, or not. If the port is configured for
VLAN headers, the VLAN header (bytes offsets 12 and 13 in the message) is
checked against the port’s allowable VLAN list. If the message
specified VLAN is not in the port’s configuration, the message is
dropped. Once the message passes the VLAN header check, it passes onto
destination MAC address processing below. If the port is NOT configured for
VLAN headers, the hypervisor (conceptually) inserts a two byte VLAN header
(based upon the port’s configured VLAN number) after byte offset 11
in the message.Next, the destination MAC address (first 6 bytes of the message) is
processed by searching the table of cached MAC addresses (built from
messages received at Logical LAN Switch ports see above). If a match for
the MAC address is not found and if there is no Trunk Adapter defined for
the specified VLAN number, then the message is dropped, otherwise if a
match for the MAC address is not found and if there is a Trunk Adapter
defined for the specified VLAN number, then the message is passed on to the
Trunk Adapter. If a MAC address match is found, then the associated switch
port is configured and the allowable VLAN number table is scanned for a
match to the VLAN number contained in the message’s VLAN header. If a
match is not found, the message is dropped. Next, the VLAN header
configuration of the destination Switch Port is checked, and if the port is
configured for VLAN headers, the message is delivered to the destination
Logical LAN IOA including any inserted VLAN header. If the port is
configured for no VLAN headers, the VLAN header is removed before being
delivered to the destination Logical LAN IOA.The Logical LAN IOA’s device tree entry includes
Unit Address, and
“ibm,my-dma-window” properties. The
“ibm,my-dma-window” property contains a
LIOBN field that represents the RTCE table used by the Logical IOA. The
Logical LAN hcall()s use the Unit Address field to imply the LIOBN and,
therefore, the RTCE table to reference.When the logical IOA is opened, the device driver registers, with the
hypervisor, as the “Buffer List”, a TCE mapped page of
partition I/O mapped memory that contains the receive buffer descriptors.
These receive buffers are mapped via a TCE mechanism from partition memory
into contiguous I/O DMA space. The first descriptor in the buffer list page
is that of the receive queue buffer. The rest of the descriptors are for a
number of buffer pools organized by increasing size of receive buffer. The
format of the descriptor is a 1 byte control field, 3 byte buffer length,
followed by a 4 byte I/O address. The number of buffer pools is determined
by the device driver (up to an architected maximum of 254). The control
field in all unused descriptors is 0h00. The last 8 bytes are reserved for
statistics.When a new message is received by the logical IOA, the list of buffer
pools is scanned starting from the second descriptor in the buffer list
looking for the first available buffer that is equal to or greater than the
received message. That buffer is removed from the pool, filled with the
incoming message, and an entry is placed on the receive queue noting the
buffer status, message length, starting data offset, and the buffer
correlator.The sender of a logical LAN message uses an hcall() that takes as
parameters the Unit Address and a list of up to 6 buffer descriptors
(length, starting I/O address pairs). The sending hcall(), after verifying
the sender owns the Unit Address, correlates the Unit Address with its
associated Logical LAN Switch port and copies the message from the send
buffer(s) into a receive buffer, as described above, for each target
logical LAN IOA that is a member of the specified VLAN. If a given logical
IOA does not have a suitable receive buffer, the message is dropped for
that logical IOA (a return code indicates that one or more destinations did
not receive a message allowing for a reliable datagram service).The logical LAN facility uses the standard H_GET_TCE and H_PUT_TCE
hcall()s to manage the I/O translations tables along with H_MIGRATE_DMA to
aid in dynamic memory reconfiguration.Logical LAN IOA Data StructuresThe Logical LAN IOA defines certain data structures as described in
following paragraphs.
outlines the
inter-relationships between several of these structures. Since multiple
hcall()s as well as multiple partitions access the same structures,
careful serialization is essential.Implementation Note: During shutdown or migration of
TCE mapped pages, implementations may choose to atomically maintain,
within a single, two field variable, a usage count of processors
currently sending data through the Logical LAN IOA combined with a
quiesce request set to the processor that is requesting the quiesce (if
no quiesce is requested, the value of this field is some reserved value).
Then a protocol, such as the following, can manage the quiesce of Logical
LAN DMA. A new sender atomically checks the DMA channel management
variable -- spinning if the quiesce field is set and subsequently
incrementing the usage count field when the quiesce variable is not set.
The sender atomically decreases the use count when through with Logical
Remote DMA copying. A quiesce requester, after atomically setting the
quiesce field with its processor number (as in a lock), waits for the
usage count to go to zero before proceeding.Buffer DescriptorThe buffer descriptor is an 8 byte quantity, on an 8
byte boundary (so that it can be written atomically). The high order byte
is control, the next 3 bytes consist of a length field of the buffer in
bytes, the low order 4 bytes are a TCE mapped I/O address of the start of
the buffer in I/O address space.Bit 0 of the control field is the valid indicator, 0 means not
valid and 1 is valid. Bits 2-5 are reserved.Bit 1 is used in the receive queue descriptor as the valid toggle
if the descriptor specifies the receive queue, else it is reserved. If
the valid toggle is a 0, then the newly enqueued receive buffer
descriptors have a valid bit value of 1, if the valid toggle is a 1, then
the newly enqueued receive buffer descriptors have a valid bit value of
0. The hypervisor flips the value of the valid toggle bit each time it
cycles from the bottom of the receive queue to the top.Bit 5 is the Large Send Indication bit and indicates that this
packet is a large-send packet. See
for more information on the usage of this bit.Bit 6 is the No Checksum bit and indicates that there is no
checksum in this packet. See
for more information on the
usage of this bit.Bit 7 is the Checksum Good bit and indicates that the checksum in
this packet has already been verified. See
for more information on the
usage of this bit.Buffer ListThis structure is used to record buffer descriptors of various
types used by the Logical LAN IOA. Additionally, running statistics about
the logical LAN adapter are maintained at the end of the structure. It
consists of one 4 KB aligned TCE mapped page. By TCE mapping the page,
the H_MIGRATE_DMA hcall() is capable of migrating this structure.The first buffer descriptor (at offset 0) contains the buffer
descriptor for the receive queue.The second buffer descriptor (at offset 8) contains the buffer
descriptor for the MAC multicast filter table.It is the architectural intent that all subsequent buffer
descriptors in the list head a pool of buffers of a given size. Further,
it is the architectural intent that descriptors are ordered in increasing
size of the buffers in their respective pools. The rest of the
description of the ILLAN option is written assuming this intent. However,
the contents of these descriptors are architecturally opaque, none of
these descriptors are manipulated by code above the architected
interfaces. This allows implementations to select the most appropriate
serialization techniques for buffer enqueue/dequeue, migration, and
buffer pool addition and subsequent garbage collection.The final 8 bytes in the buffer list is a counter of frames dropped
because there was not a buffer in the buffer list capable of holding the
frame.Receive QueueThe receive queue is a circular buffer used to store received
message descriptors. The device driver sizes the buffer used for the
receive queue in multiples of 16 bytes, starting on an 16 byte boundary
(to allow atomic store operations) with, at least, one more 16 byte entry
than the maximum number of possible outstanding receive buffers. Failure
to have enough receive queue entries, may result in receive messages, and
their buffers being lost since the logical IOA assumes that there are
always empty receive queue elements and does not check. When the device
driver registers the receive queue buffer, the buffer contents should be
all zeros, this insures that the valid bits are all off.If a message is received successfully, the next 16 byte area
(starting with the area at offset 0 for the first message received after
the registration of the receive queue and looping back to the top after
the last area is used) in the receive queue is written with a message
descriptor as shown in
. Either the entire entry is
atomically written, or the write order is serialized such that the
control field is globally visible after all other fields are
visible.
Receive Queue EntryField NameByte OffsetLengthDefinitionControl01Bit 0 = the appropriate valid indicator.Bit 1 = 1 if the buffer contains a valid message. Bit 1 =
0 if the buffer does not contain a valid message, in which case
the device driver recycles the buffer.Bits 2-4 Reserved.Bit 5: Large Send Indication bit. If a 1, then this
indicates the packet is a large-send packet.Bit 6: No Checksum bit. If a 1, then this indicates that
there is no checksum in this packet (see
for more information
on the usage of this bit).Bit 7: Checksum Good bit. If a 1, then this indicates
that the checksum in this packet has already been verified (see
for more information
on the usage of this bit).Reserved11Reserved for future use.Message Offset22The byte offset to the start of the received message. The
minimum offset is 8 (to bypass the message correlator field);
larger offsets may be used to allow for optimized data copy
operations.Message Length44The byte length of the received message.Opaque handle88Copy of the first 8 bytes contained in the message buffer
as passed by the device driver.
So that the device driver never has to write into the receive
queue, the VLAN logical IOA alternates the value of the valid bit on each
pass through the receive queue buffer. On the first pass following
registration, the valid bit value is written as a 1, on the next as a
zero, on the third as a 1, and so on. To allow the device driver to
follow the state of the valid bit, the Logical LAN IOA maintains a valid
bit toggle in bit 1 of the receive queue descriptor control byte. The
Logical LAN IOA increments its enqueue pointer after each enqueue. If the
pointer increment (modulo the buffer size) loops to the top, the valid
toggle bit alternates state.Following the write of the message descriptor, if enqueue
interrupts are enabled and there is not an outstanding interrupt signaled
from the Logical LAN IOA’s interrupt source number, an interrupt is
signaled.It is the architectural intent that the first 8 bytes of the buffer
is a device driver supplied opaque handle that is copied into the receive
queue entry. One possible format of the opaque handle is the OS effective
address of the buffer control block that pre-ends the buffer as seen by
the VLAN Logical IOA. Within this control block might be stored the total
length of the buffer, the 8 byte buffer descriptor (used to enqueue this
buffer using the H_ADD_LOGICAL_LAN_BUFFER hcall()) and other control
fields as deemed necessary by the device driver.When servicing the receive interrupt, it is the architectural
intent that the device driver starts to search the receive queue using a
device driver maintained receive queue service pointer (initially
starting, after buffer registration, at the offset zero of the receive
queue) servicing all receive queue entries with the appropriate valid
bit, until reaching the first invalid receive queue entry. The receive
queue service pointer is also post incremented, modulo the receive queue
buffer length, and the device driver’s notion of valid bit state is
also toggled/read from the receive queue descriptor’s valid bit
toggle bit, on each cycle through the circular buffer. After all valid
receive queue entries are serviced, the device driver resets the
interrupt.
. After the interrupt reset,
the device driver again scans from the new value of the receive queue
service pointer to pick up any entries that may have been enqueued during
the interrupt reset window.MAC Multicast Filter ListThis one 4 KB page (aligned on a 4 KB boundary) opaque data
structure is used by firmware to contain multicast filter MAC addresses.
The table is initialized by firmware by the H_REGISTER_LOGICAL_LAN
hcall(). Any modification of this table by the partition software (OS or
device driver) is likely to corrupt its contents which may corrupt/affect
the OS’s partition but not other partitions, that is, the
hypervisor may not experience significant performance degradation due to
table corruption. However, for the partition that corrupted its filter
list, the hypervisor may deliver multicast address packets that had
previously been requested to be filtered out, or it may fail to deliver
multicast address packets that had been requested to be delivered.Receive BuffersThe Logical LAN firmware requires that the minimum size receive
buffer is 16 bytes aligned on an 4 byte boundary so that stores of
linkage pointer may be atomic. Minimum IP message sizes, and message
padding areas force a larger minimum size buffer.The first 8 bytes of the receive buffer are reserved for a device
driver defined opaque handle that is written into the receive queue entry
when the buffer is filled with a received message. Firmware never
modifies the first 8 bytes of the receive buffer.From the time of buffer registration via the
H_ADD_LOGICAL_LAN_BUFFER hcall() until the buffer is posted onto the
receive queue, the entire buffer other than the first 8 bytes are subject
to modification by the firmware. Any modification of the buffer contents,
during this time, by non-firmware code subjects receive data within the
partition to corruption. However, any data corruption caused by errors in
partition code does not escape the offending partition, except to the
extent that the corruption involves the data in Logical LAN send
buffers.Provisions are incorporated in the receive buffer format for a
beginning pad field to allow firmware to make use of data transfer
hardware that may be alignment sensitive. While the contents of the Pad
fields are undefined, firmware is not allowed to make visible to the
receiver more data than was specifically included by the sender in the
transfer message, so as to avoid a covert channel between the
communicating partitions.
Receive Buffer FormatField NameByte OffsetLengthDefinitionOpaque Handle08Per design of the device driver.Pad 180-L1 cache line sizeThis field, containing undefined data, may be included by
the firmware to align data for optimized transfers.Messagedefined by the “Message Offset” field of the
Receive Queue Entry12-NThe destination and source MAC address are at the first
two 6 byte fields of the message, followed by the message
payload.Pad 2To end of bufferBuffer contents after the Message field are
undefined.
Logical LAN Device Tree NodeThe Logical LAN device tree node is a child of the
vdevice node which itself is a child of
/ (the root node). There exists one such node for
each logical LAN virtual IOA instance. Additionally, Logical LAN device
tree nodes have associated packages such as obp-tftp and load method as
appropriate to the specific virtual IOA configuration as would the node
for a physical IOA of type network.Logical IOA’s intrinsic MAC address -- This number is
guaranteed to be unique within the scope of the Logical LAN.
Properties of the Logical LAN OF Device Tree
NodeProperty NameRequired?Definition“name”YStandard property name per
, specifying the
virtual device name, the value shall be
“l-lan”.“device_type”YStandard property name per
, specifying the
virtual device type, the value shall be
“network”.“model”NAProperty not present.“compatible”YStandard property name per
, specifying the
programming models that are compatible with this virtual IOA,
the value shall include
“IBM,l-lan”.“used-by-rtas”See definition columnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA, the value shall
be of the form defined in
.“reg”YStandard property name per
, specifying the unit
address (unit ID) associated with this virtual IOA presented as
an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of three values
(LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int.“local-mac-address”YStandard property name per
, specifying the
locally administered MAC addresses are denoted by having the
low order two bits of the high order byte being 0b10.“mac-address”See definition columnInitial MAC address (may be changed by
H_CHANGE_LOGICAL_LAN_MAC hcall()). Note: There have been
requests for a globally unique mac address per logical LAN IOA.
However, the combination of -- that requiring that the platform
ship with an unbounded set of reserved globally unique
addresses -- which clearly cannot work -- plus the availability
of IP routing for external connectivity have overridden those
requests.“ibm,mac-address-filters”YProperty name specifying the number of non-broadcast
multicast MAC filters supported by this implementation (between
0 and 255) presented as an encoded array encoded as with
encode-int.“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall().“ibm,my-drc-index”For DR“ibm,vserver”YProperty name specifying that this is a virtual server
node.“ibm,trunk-adapter”See definition columnProperty name specifying that this is a Trunk Adapter.
This property must be provided when the node is a Trunk Adapter
node.“ibm,illan-options”See definition columnThis property is required when any of the ILLAN
sub-options are implemented (see
). The existence of
this property indicates that the H_ILLAN_ATTRIBUTES hcall() is
implemented, and that hcall() is then used to determine which
ILLAN options are implemented.“supported-network-types”YStandard property name as per
.
Reports possible types of “network”
the device can support.“chosen-network-type”YStandard property name as per
.
Reports the type of “network” this
device is supporting.“max-frame-size”YStandard property name per
, to indicate maximum
packet size.“address-bits”YStandard property name per
, to indicate network
address length.“ibm,#dma-size-cells”See definition columnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See definition columnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.
Logical LAN hcall()sThe receiver can set the virtual interrupt associated with its
Receive Queue to one of two modes using the H_VIO_SIGNAL hcall(). These
are:Disabled (An enqueue interrupt is not signaled.)Enabled (An enqueue interrupt is signaled on every
enqueue)Note: An enqueue is considered a pulse not a level.
The pulse then sets the memory element within the emulated interrupt
source controller. This allows the resetting of the interrupt condition
by simply issuing the H_EOI hcall() as is done with the PCI MSI
architecture rather than having to do an explicit interrupt reset as in
the case with PCI LSI architecture.The interrupt mechanism, however, is capable of presenting only one
interrupt signal at a time from any given interrupt source. Therefore, no
additional interrupts from a given source are ever signaled until the
previous interrupt has been processed through to the issuance of an H_EOI
hcall(). Specifically, even if the interrupt mode is enabled, the effect
is to interrupt on an empty to non-empty transition of the queue.H_REGISTER_LOGICAL_LANSyntax:Parameters:unit-address: As specified in the Logical LAN device tree node
“reg” propertybuf-list: I/O address of a 4 KB page (aligned) used to record
registered input buffersrec-queue: Buffer descriptor of a receive queue, specifying a
receive queue which is a multiple of 16 bytes in length and is 16 byte
alignedfilter-list: I/O address of a 4 KB page aligned broadcast MAC
address filter listmac-address: The receive filter MAC addressSemantics:Validate the Unit Address else H_ParameterValidate the I/O addresses of the buf-list and filter-list is in
the TCE and is 4K byte aligned else H_ParameterValidate the Buffer Descriptor of the receive queue buffer (I/O
addresses for entire buffer length starting at the specified I/O address
are translated by the RTCE table, length is a multiple of 16 bytes, and
alignment is on a 16 byte boundary) else H_Parameter.Initialize the one page buffer listEnqueue the receive queue buffer (set valid toggle to 0).Initialize the hypervisor’s receive queue enqueue pointer
and length variables for the virtual IOA associated with the Unit
Address. These variables are kept in terms of DMA addresses so that page
migration works and any remapping of TCEs is effective.Disable receive queue interrupts.Record the low order 6 bytes of mac-address for filtering future
incoming messages.Return H_Success.H_FREE_LOGICAL_LANSyntax:Parameters:unit-address: Unit Address per device tree node
“reg” property.Semantics:Validate the Unit Address else H_ParameterInterlock/carefully manipulate tables so that H_SEND_LOGICAL_LAN
performs safely.Clear the associated page buffer list, prevent further
consumption of receive buffers and generation of receive
interrupts.Return H_Success.H_FREE_LOGICAL_LAN is the only valid mechanism to reclaim the
memory pages registered via H_REGISTER_LOGICAL_LAN.Implementation Note: If the hypervisor returns an
H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must
call H_FREE_LOGICAL_LAN again with the same parameters. Software may
choose to treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same
as H_Busy. The hypervisor, prior to returning H_Busy,
H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, will have placed the
virtual adapter in a state that will cause it to not accept any new work
nor surface any new virtual interrupts (no new frames will arrive,
etc.).H_ADD_LOGICAL_LAN_BUFFERSyntax:Parameters:unit-address: Unit Address per device tree node
“reg” propertybuf: Buffer Descriptor of new I/O bufferSemantics:Checks that unit address is OK else H_Parameter.Checks that I/O Address is within range of DMA window.Scans the buffer list for a pool of buffers of the length
specified in the DescriptorIf one does not exist (and there is still room in the buffer
list, create a new pool entry else H_Resource).Uses enqueue procedure that is compatible with H_SEND_LOGICAL_LAN
hcall()’s dequeue procedureImplementation Note: Since the buffer queue is based upon I/O
addresses that are checked by H_SEND_LOGICAL_LAN, it is only necessary to
insure that the enqueue/dequeue are internally consistent. If the owning
OS corrupts his buffer descriptors or buffer queue pointers, this is
caught by H_SEND_LOGICAL_LAN and/or the corruption is contained within
the OS’s partition.Architecture Note: Consideration was given to define the enqueue
algorithm and have the DD do the enqueue itself. However, no designs
presented themselves that eliminated the timing windows caused by adding
and removing pool lists without the introduction of OS/FW
interlocks.H_FREE_LOGICAL_LAN_BUFFERSyntax:Parameters:unit-address: Unit Address per device tree node
“reg” property.bufsize: The size of the buffer that is being requested to be
removed from the receive buffer pool.Semantics:Check that unit address is valid, else return H_Parameter.Scan the buffer list for a pool of buffers of the length
specified in bufsize, and return H_Not_Found if one does not
exist.Place an entry on receive queue for buffer of specified size,
with Control field Bit 1 set to 0, and return H_SuccessH_SEND_LOGICAL_LANSyntax:The H_Dropped return code indicates to the sender that one or more
intended receivers did not receive the message.Parameters:unit-address: Unit Address per device tree node
“reg” propertybuff-1: Buffer Descriptor #1buff-2: Buffer Descriptor #2buff-3: Buffer Descriptor #3buff-4: Buffer Descriptor #4buff-5: Buffer Descriptor #5buff-6: Buffer Descriptor #6continue-token: Used to continue a transfer if H_Busy is
returned. Set to 0 on the first call. If H_Busy is returned, then call
again but use the value returned in R4 from the previous call as the
value of continue-token.Semantics:If continue-token is non-zero, then do appropriate checks to see
that parameters and buffers are still valid, and pickup where the
previous transfer left off for the specified unit address, based on the
value of the continue-token.If continue-token is zero and if previous H_SEND_LOGICAL_LAN for
the specified unit address was suspended with H_Busy and never completed,
then cleanup the state from the previously suspended call before
proceeding.Verifies the VLAN number -- else H_Parameter.Proceeds down the 6 buffer descriptors until the first one that
has a length of 0If the
“ibm,max-virtual-dma-size” property exist
in the
/vdevice node of the device tree, then if the length
is greater than the value of this property, return H_ParameterFor the length of the buffer:Verifies the I/O buffer addresses translate through the
sender’s RTCE table else H_Parameter.Verifies the destination MAC address for the VLANIf MAC address is not cached and there exists a Trunk Adapter for
the VLAN, then flags the message as destined for the Trunk Adapter and
continues processingIf MAC address is not cached and a Trunk Adapter does not exist
for the VLAN, then drop the message (H_Dropped)For each Destination MAC Address (broadcast MAC address turns
into multi-cast to all destinations on the specified VLAN):In the case of multicast MAC addresses the following algorithm
defines the members of the receiver class for a given VLAN:For each logical lan IOA that would be a target for a broadcast
from the source IOA:If the receiving IOA is not enabled for non-broadcast multicast
frames then continueIf the receiving IOA is not enabled for filtering non-broadcast
multicast frames then copy the frame to the IOA's receive bufferElseIf (lookup filter (table index)) then copy the frame to the
IOA's receive bufferElse if the receiving IOA is not enabled for filtering
non-broadcast multicast frames then copy the frame to the IOA's receive
buffer /*allows for races on filter insertion */int lookup filter (table index)Firmware implementation designed algorithmSearches the receiver’s receive queue for a suitable buffer
and atomically dequeues it:If no suitable buffer is found, the receiver’s dropped
packet counter (last 8 bytes of buffer list) is incremented and
processing proceeds to the next receiver if any.Copy the send data in to the selected receive buffer, build a
receive queue entry, and generate an interrupt to the receiver if the
interrupt is enabled.If any frames were dropped return H_Dropped else return
H_Success.Firmware Implementation Note: If during the
processing of the H_SEND_LOGICAL_LAN call, it becomes necessary to
temporarily suspend the processing of the call (for example, due to the
length of time it is taking to process the call), the firmware may return
a continuation token in R4, along with the return code of H_Busy. The
value of the continuation token is up to the firmware, and will be passed
back by the software as the continue-token parameter on the next call of
H_SEND_LOGICAL_LAN.This hcall() interlocks with H_MIGRATE_DMA to allow for migration
of TCE mapped DMA pages.Note: It is possible for either or both the sending
and receiving OS to modify its RTCE tables so as to affect the TCE
translations being actively used by H_SEND_LOGICAL_LAN. This is an error
condition on the part of the OS. Implementations need only insure that
such conditions do not corrupt memory in innocent partitions and should
not add path length to protect guilty partitions. By all means the path
length of H_GET_TCE and H_PUT_TCE should not be increased. If reasonably
possible, without significant path length addition, implementations
should: On send buffer translation corruption, return H_Parameter to the
sender and either totally drop the packet prior to reception, or if the
receive buffer has been processed past the point of transparent
recycling, mark the receive buffer as received in error in the receive
queue. On receive buffer translation corruption, terminate the data copy
to the receive buffer and mark the buffer as received in error in the
receive queue.H_MULTICAST_CTRLThis hcall() controls the reception of non-broadcast multicast
packets (those with the high order address byte being odd but not the all
1’s address). All implementations support the enabling and
disabling of the reception of all multicast packets on their V-LAN.
Additionally, the l-lan device driver through this call may ask the
firmware to filter multicast packets for it. That is, receive packets
only if they contain multicast addresses specified by the device driver.
The number of simultaneous multicast packet filters supported is
implementation dependent, and is specified in the
“ibm,mac-address-filters” property of the
l-lan device tree node. Therefore, the device driver must be prepared to
have any filter request fail, and fall back to enabling reception of all
multicast packets and filtering them in the device driver. Semantically,
the device driver may ask that the reception of multicast packets be
enabled or disabled, further if reception is enabled, they may be
filtered by only allowing reception of packets who’s mac address
matches one of the entries in the filter table. The call also manages the
contents of the mac address filter table. Individual mac addresses may be
added, or removed, and the filter table may be cleared. If the filter
table is modified by a call, there is the possibility that a packet may
be improperly filtered (one that was to be filtered out may get through
or one that should have gotten through may be dropped) this is done to
avoid adding extra locking to the packet processing code. In most cases
higher level protocols will handle the condition (since firmware
filtering is simply a performance optimization), if, however, a specific
condition requires complete accuracy, the device driver can disable
filtering prior to an update, do its own filtering (as would be required
if the number of receivers exceeded the number of filters in the filter
table) update the filter table, and then reenable filtering.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” propertyflags: Only bits 44-47 and 62-63 are defined all other bits
should be zero.multi-cast-address: Multicast MAC address, if flag bits 62 and 63
are 01 or 10, else this parameter is ignored.Return value in register R4:State of Enables and Count of MAC Filters in table.Format:R = The value of the Receipt Enable bitF = The value of the Filter Enable bitMAC Filter Count -- 16 bit count of the number of MAC Filters in
the multicast filter table.Semantics:Validate the unit-address parameter else return
H_Parameter.Validate that no reserved flag bit = 1 else return
H_Parameter.If any bits are on in the high order two bytes of the MAC
parameter Return H_ParameterModify Enables per specification if requested.Modify the Filter Table per specification if requested filtering
is disable during any filter table modification and filter enable state
restored after filter table modification).If don't modify RC=H_SuccessIf Clear all: initialize the filter table, RC=H_SuccessIf Add:If there is room in the table insert new MAC Filter entry, MAC
Filter count++, RC=H_SuccessElse RC=H_Constrained(duplicates are silently dropped -- filter count stays the same
RC=H_Success)If Remove:Locate the specified entry in the MAC Filter TableIf Found remove the entry, MAC Filter count--,
RC=H_SuccessElse RC=H_Not_FoundLoad the Enable Bits into R4 bits 46 and 47 Load the MAC Filter
count into R4 Bits 48-63Return RCH_CHANGE_LOGICAL_LAN_MACThis hcall() allows the changing of the virtual IOA’s MAC
address.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” propertymac-address: The new receive filter MAC addressSemantics:Validates the unit address, else H_ParameterRecords the low order 6 bytes of mac-address for filtering future
incoming messagesReturns H_SuccessH_ILLAN_ATTRIBUTESThere are certain ILLAN attributes that are made visible to and can
be manipulated by partition software. The H_ILLAN_ATTRIBUTES hcall is
used to read and modify the attributes (see
).
defines the attributes that are
visible and manipulatable.
ILLAN AttributesBit(s)Field NameDefinition0-45Reserved46Checksum Offload Non-zero Checksum Field SupportThis bit is implemented when PHYP supports sending TCP
packets with a non-zero TCP checksum field when bit 6 of the
buffer descriptor (the "No Checksum" bit) is set. This bit
indicates R1–17.3.6.2.2–3 is not required.47Reserved48Large Send Indication SupportedThe bit is implemented when the large send indication bit in
the I/O descriptor passed to H_SEND_LOGICAL_LAN is
supported by firmware.0: Software must not request large send indication,
by setting Bit 5 of the buffer descriptor.1: Software may request large send indication, by
setting Bit 5 of the buffer descriptor.49Port DisabledWhen the bit is a 1, the port is disabled. When
the port is disabled, no Ethernet traffic will be
permitted to be transmitted or received.
H_Parameter will be returned if this bit is turned
on in either the reset or set masks. On firmware
that does not support this function, bit 49 is
reserved and required to be 0. OS can infer that
means the port is enabled.50Checksum Offload Padded Packet SupportThis bit is implemented when the ILLAN Checksum Offload
Padded Packet Support option is implemented. See
.0: Software must not request checksum offload, by setting
Bit 6 of the buffer descriptor (the No Checksum bit), for
packets that have been padded.1: Software may request checksum offload, by setting Bit
6 of the buffer descriptor (the No Checksum bit), for packets
that have been padded.51Buffer Size ControlThis bit is implemented when the ILLAN Buffer Size
Control option is implemented. This bit allows the partition
software to inhibit the use of too large of a buffer for
incoming packets, when a reasonable size buffer is not
available. The state of this bit cannot be changed between the
time that the ILLAN is registered by an H_REGISTER_LOGICAL_LAN
and it is deregistered by an H_FREE_LOGICAL_LAN. See also
.1: The hypervisor will keep a history of what buffer
sizes have been registered. When a packets arrives the history
is searched to find the smallest buffers size that will contain
the packet. If that buffer size is depleted then the packet is
dropped by the hypervisor (H_Dropped) instead of searching for
the next larger available buffer.0: This is the initial value. When a packet arrives, the
available buffers are searched for the smallest available
buffer that will hold the packet, and the packet is not dropped
unless no buffer is available in which the packet will
fit.52-55Trunk Adapter PriorityThis field is implemented for a VIOA whenever the ILLAN
Backup Trunk Adapter option is implemented and the VIOA is a
Trunk Adapter (the Active Trunk Adapter bit will be
implemented, also, in this case). If this field is a 0, then
the either the ILLAN Backup Trunk Adapter option is not
implemented or it is implemented but this VIOA is not a Trunk
Adapter. A non-0 value in this field reflects the priority of
the node in the backup Trunk Adapter hierarchy, with a value of
1 being the highest (most favored) priority, the value of 2
being the next highest priority, and so on. This field may or
may not be changeable by the partition firmware via the
H_ILLAN_ATTRIBUTES hcall() (platform implementation dependent).
If not changeable, then attempts to change this field will
result in a return code of H_Constrained. See also
.56-60Reserved61TCP Checksum Offload Support for IPv6This bit is implemented for a VIOA whenever the ILLAN
Checksum Offload Support option is implemented for TCP, the
IPv6 protocol, and the following extension headers:Hop-by-Hop OptionsRoutingDestination OptionsAuthenticationMobilityThis bit is initially set to 0 by the firmware and the
ILLAN DD may attempt to set it to a 1 by use of the
H_ILLAN_ATTRIBUTES hcall() if the DD supports the option for
TCP and IPv6. Firmware will not allow changing the state of
this bit if it does not support Checksum Offload Support for
TCP for IPv6 for the VIOA (H_Constrained would be returned in
this case from the H_ILLAN_ATTRIBUTES hcall() when this bit is
a 1 in the set-mask). This state of this bit cannot be changed
between the time that the ILLAN is registered by an
H_REGISTER_LOGICAL_LAN and it is deregistered by an
H_FREE_LOGICAL_LAN. See
for more
information.1: The partition software has indicated that it supports
the ILLAN Checksum Offload Support option for TCP and IPv6
protocol and for the above stated extension headers by using
the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the
set-mask, and the firmware has verified that it supports this
protocol for the option for the VIOA.0: The partition software has not indicated that it
supports the ILLAN Checksum Offload Support option for TCP and
IPv6 protocol and for the above stated extension headers by
using the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1
in the set-mask, or it has but the firmware does not support
the option, or supports the option but not for this protocol or
for this VIOA.62TCP Checksum Offload Support for IPv4This bit is implemented for a VIOA whenever the ILLAN
Checksum Offload Support option is implemented for TCP and the
IPv4 protocol. This bit is initially set to 0 by the firmware
and the ILLAN DD may attempt to set it to a 1 by use of the
H_ILLAN_ATTRIBUTES hcall() if the DD supports the option for
TCP and IPv4. Firmware will not allow changing the state of
this bit if it does not support Checksum Offload Support for
TCP or IPv4 for the VIOA (H_Constrained would be returned in
this case from the H_ILLAN_ATTRIBUTES hcall() when this bit is
a 1 in the set-mask). This state of this bit cannot be changed
between the time that the ILLAN is registered by an
H_REGISTER_LOGICAL_LAN and it is deregistered by an
H_FREE_LOGICAL_LAN. See
for more
information.1: The partition software has indicated that it supports
the ILLAN Checksum Offload Support option for TCP and IPv4
protocol by using the H_ILLAN_ATTRIBUTES hcall() with this bit
set to a 1 in the set-mask, and the firmware has verified that
it supports this protocol for the option for the VIOA.0: The partition software has not indicated that it
supports the ILLAN Checksum Offload Support option for TCP and
IPv4 by using the H_ILLAN_ATTRIBUTES hcall() with this bit set
to a 1 in the set-mask, or it has but the firmware does not
support the option, or supports the option but not for this
protocol or for this VIOA.63Active Trunk AdapterThis bit is implemented for a VIOA whenever the ILLAN
Backup Trunk Adapter option is implemented and the VIOA is a
Trunk Adapter (the Trunk Adapter Priority field will be
implemented, also, in this case).This bit is initially set to 0 by the firmware for an
inactive Trunk Adapter.This bit is initially set to 1 by the firmware for an
active Trunk Adapter.This bit will be changed from a 0 to a 1 when all the
following a true: (1) the partition software (via the
H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the
set-mask) attempts to set this bit to a 1, (2) the firmware
supports the Backup Trunk Adapter option, (3) the VIOA is a
Trunk Adapter.This bit will be changed from a 1 to a 0 by the firmware
when another Trunk Adapter has had its Active Trunk Adapter bit
changed from a 0 to a 1.See
for more
information.1: The VIOA is the active Trunk Adapter.0: The VIOA is not an active Trunk Adapter or is not a
Trunk Adapter at all.
R1--1.If the H_ILLAN_ATTRIBUTES hcall is
implemented, then it must implement the attributes as they are defined in
and the syntax and semantics as
defined in
.R1--2.The H_ILLAN_ATTRIBUTES hcall must ignore
bits in the set-mask and reset-mask which are not implemented for the
specified unit-address and must process as an exception those which
cannot be changed for the specified unit-address (H_Constrained
returned), and must return the following for the ILLAN Attributes in
R4:A value of 0 for unimplemented bit positions.The resultant field values for implemented fields.Syntax:Parameters:unit-address: Unit Address per device tree node
“reg” property. The ILLAN unit address on
which this Attribute modification is to be performed.reset-mask: The bit-significant mask of bits to be reset in the
ILLAN’s Attributes (the reset-mask bit definition aligns with the
bit definition of the ILLAN’s Attributes, as defined in
). The complement of the
reset-mask is ANDed with the ILLAN’s Attributes, prior to applying
the set-mask. See semantics for more details on any field-specific
actions needed during the reset operations. If a particular field
position in the ILLAN Attributes is not implemented, then the
corresponding bit(s) in the reset-mask are ignored.set-mask: The bit-significant mask of bits to be set in the
ILLAN’s Attributes (the set-mask bit definition aligns with the bit
definition of the ILLAN’s Attributes, as defined in
). The set-mask is ORed with
the ILLAN’s Attributes, after to applying the reset-mask. See
semantics for more details on any field-specific actions needed during
the set operations. If a particular field position in the ILLAN
Attributes is not implemented, then the corresponding bit(s) in the
set-mask are ignored.Semantics:Validate that Unit Address belongs to the partition, else
H_Parameter.Reset/set the bits in the ILLAN Attributes, as indicated by the
rest-mask and set-mask except as indicated in the following
conditions.If the Buffer Size Control bit is trying to be changed from a 0
to a 1 and any of the following is true, then do not allow the change
(H_Constrained will be returned):The ILLAN is active. That is, the ILLAN has been registered
(H_REGISTER_LOGICAL_LAN) but has not be deregistered
(H_FREE_LOGICAL_LAN).The firmware does not support the ILLAN Buffer Size Control
option.If the Buffer Size Control bit is trying to be changed from a 1
to a 0 and any of the following is true, then do not allow the change
(H_Constrained will be returned):The ILLAN is active. That is, the ILLAN has been registered
(H_REGISTER_LOGICAL_LAN) but has not be deregistered
(H_FREE_LOGICAL_LAN).If either the TCP Checksum Offload Support for IPv4 bit or TCP
Checksum Offload Support for IPv6 bit is trying to be changed from a 0 to
a 1 and any of the following is true, then do not allow the change
(H_Constrained will be returned):The ILLAN is active. That is, the ILLAN has been registered
(H_REGISTER_LOGICAL_LAN) but has not be deregistered
(H_FREE_LOGICAL_LAN).The firmware does not support the ILLAN Checksum Offload Support
option or supports it, but not for the specified protocol(s) or does not
support it for this VIOA.If the TCP Checksum Offload Support for IPv4 bit or TCP Checksum
Offload Support for IPv6 bit is trying to be changed from a 1 to a 0 and
any of the following is true, then do not allow the change (H_Constrained
will be returned):The ILLAN is active. That is, the ILLAN has been registered
(H_REGISTER_LOGICAL_LAN) but has not be deregistered
(H_FREE_LOGICAL_LAN).If the Active Trunk Adapter bit is trying to be changed from a 0
to a 1 and any of the following is true, then do not allow the change
(H_Constrained will be returned):The firmware does not support the ILLAN Backup Trunk Adapter
option or this VIOA is not a Trunk Adapter.If the Active Trunk Adapter bit is trying to be changed from a 1
to a 0, then return H_Parameter.If the Active Trunk Adapter bit is changed from a 0 to a 1 for a
VIOA, then also set any previously active Trunk Adapter’s Active
Trunk Adapter bit from a 1 to a 0.If the Trunk Adapter Priority field is trying to be changed from
0 to a non-0 value, then return H_Parameter.If the Trunk Adapter Priority field is trying to be changed from
a non-0 value to another non-0 value and either the parameter is not
changeable or the change is not within the platform allowed limits, then
do not allow the change (H_Constrained will be returned).Load R4 with the value of the ILLAN’s Attributes, with any
unimplemented bits set to 0, and if all requested changes were made then
return H_Success, otherwise return H_Constrained.Other hcall()s extended or used by the Logical LAN
OptionH_VIO_SIGNALThe H_VIO_SIGNAL hcall() is used by multiple VIO options.H_EOIThe H_EOI hcall(), when specifying an interrupt source number
associated with an interpartion logical LAN IOA, incorporates the
interrupt reset function.H_XIRRThis call is extended to report the virtual interrupt source number
associated with virtual interrupts associated with an ILLAN IOA.H_PUT_TCEThis standard hcall() is used to manage the ILLAN IOA’s I/O
translations.H_GET_TCEThis standard hcall() is used to manage the ILLAN IOA’s I/O
translations.H_MIGRATE_DMAThis hcall() is extended to serialize with the H_SEND_LOGICAL_LAN
hcall() to allow for migration of TCE mapped DMA pages.RTAS Calls Extended or Used by the Logical LAN
OptionPlatforms may combine the Logical LAN option with most other LoPAR
options such as dynamic reconfiguration by including the appropriate OF
properties and extending the associated firmware calls. However, the
ibm,set-xive, ibm,get-xive, ibm,int-off,
and
ibm,int-on RTAS calls are extended as part of the
base support.Interpartition Logical LAN RequirementsThe following requirements are mandated for platforms implementing
the ILLAN option.R1--1.For the ILLAN option: The platform must interpret
logical LAN buffer descriptors as defined in
.R1--2.For the ILLAN option: The platform must reject
logical LAN buffer descriptors that are not 8 byte aligned.R1--3.For the ILLAN option: The platform must interpret the
first byte of a logical LAN buffer descriptor as a control byte, the high
order bit being the valid bit.R1--4.For the ILLAN option: The platform must set the next
to high order bit of the control byte of the logical LAN buffer
descriptor for the receive queue to the inverse of the value currently
being used to indicate a valid receive queue entry.R1--5.For the ILLAN option: The platform must interpret the
2nd through 4th bytes of a logical LAN buffer descriptor as the binary
length of the buffer in I/O space (relative to the TCE mapping table
defined by the logical IOA’s
“ibm,my-dma-window” property).R1--6.For the ILLAN option: The platform must interpret the
5th through 8th bytes of a logical LAN buffer descriptor as the binary
beginning address of the buffer in I/O space (relative to the TCE mapping
table defined by the logical IOA’s
“ibm,my-dma-window” property).R1--7.For the ILLAN option: The platform must interpret
logical LAN Buffer Lists as defined in
.R1--8.For the ILLAN option: The platform must reject
logical LAN Buffer Lists that are not mapped relative to the TCE mapping
table defined by the logical IOA’s
“ibm,my-dma-window” property.R1--9.For the ILLAN option: The platform must reject
logical LAN buffer lists that are not 4 KB aligned.R1--10.For the ILLAN option: The platform must interpret the
first 8 bytes of a logical LAN buffer list as a buffer descriptor for the
logical IOA’s Receive Queue.R1--11.For the ILLAN option: The platform must interpret the
logical LAN receive queue as defined in
.R1--12.For the ILLAN option: The platform must reject a
logical LAN receive queue that is not mapped relative to the TCE mapping
table defined by the logical IOA’s
“ibm,my-dma-window” property.R1--13.For the ILLAN option: The platform must reject a
logical LAN receive queue that is not aligned on a 4 byte
boundary.R1--14.For the ILLAN option: The platform must reject a
logical LAN receive queue that is not an exact multiple of 12 bytes
long.R1--15.For the ILLAN option: The platform must manage the
logical LAN receive queue as a circular buffer.R1--16.For the ILLAN option: The platform must enqueue 12
byte logical LAN receive queue entries when a new message is
received.R1--17.For the ILLAN option:
The platform must
set the last 8 bytes of the logical LAN receive queue entry to the value
of the user supplied correlator found in the first 8 bytes of the logical
LAN receive buffer used to contain the message before setting the first 4
bytes of the logical LAN receive queue entry.R1--18.For the ILLAN option:
The platform must
set the first 4 bytes of the logical LAN receive queue entry such that
the first byte contains the control field (high order bit the inverse of
the valid toggle in the receive queue buffer descriptor, next bit to a
one if the message payload is valid) and the last 3 bytes contains the
receive message length, after setting the correlator field in the last 8
bytes per Requirement
.R1--19.For the ILLAN option: The platform must when crossing
from the end of the logical LAN receive queue back to the beginning
invert the value of the valid toggle in the receive queue buffer
descriptor.R1--20.For the ILLAN option: The platform’s OF must
disable interrupts from the logical LAN IOA before initially passing
control to the booted client program.R1--21.For the ILLAN option: The platform must present (as
appropriate per RTAS control of the interrupt source number) the
partition owning a logical LAN receive queue the appearance of an
interrupt, from the interrupt source number associated, through the OF
device tree node, with the virtual device, when a new entry is enqueued
to the logical LAN receive queue and the last interrupt mode set via the
H_VIO_SIGNAL was “Enabled”, unless a previous interrupt from
the interrupt source number is still outstanding.R1--22.For the ILLAN option: The platform must NOT present
the partition owning a logical LAN receive queue the appearance of an
interrupt, from the interrupt source number associated, through the OF
device tree node, with the virtual device, if the last interrupt mode set
via the H_VIO_SIGNAL was “Disabled”, unless a previous
interrupt from the interrupt source number is still outstanding.R1--23.For the ILLAN option: The platform must interpret
logical LAN receive buffers as defined in
.R1--24.For the ILLAN option: The platform must reject a
logical LAN receive buffer that is not mapped relative to the TCE mapping
table defined by the logical IOA’s
“ibm,my-dma-window” property.R1--25.For the ILLAN option: The platform must reject a
logical LAN receive buffer that is not aligned on a 4 byte
boundary.R1--26.For the ILLAN option: The platform must reject a
logical LAN receive buffer that is not a minimum of 16 bytes long.R1--27.For the ILLAN option: The platform must not modify
the first 8 bytes of a logical LAN receive buffer, this area is reserved
for a user supplied correlator value.R1--28.For the ILLAN option: The platform must not allow
corruption caused by a user modifying the logical LAN receive buffer from
escaping the user partition (except as a side effect of some another user
partition I/O operation).R1--29.For the ILLAN option: The platform’s
l-lan OF device tree node must contain properties as
defined in
. (Other standard I/O adapter
properties are permissible as appropriate.)R1--30.For the ILLAN option: The platform must implement the
H_REGISTER_LOGICAL_LAN hcall() as defined in
.R1--31.For the ILLAN option: The platform must implement the
H_FREE_LOGICAL_LAN hcall() as defined in
.R1--32.For the ILLAN option: The platform must implement the
H_ADD_LOGICAL_LAN_BUFFER hcall() as defined in
.R1--33.For the ILLAN option: The platform must implement the
H_SEND_LOGICAL_LAN hcall() as defined in
.R1--34.For the ILLAN option: The platform must implement the
H_SEND_LOGICAL_LAN hcall() such that an OS requested modification to an
active RTCE table entry cannot corrupt memory in other partitions.
(Except indirectly as a result of some other of the partition’s I/O
operations.)R1--35.For the ILLAN option: The platform must implement the
H_CHANGE_LOGICAL_LAN_MAC hcall() as defined in
.R1--36.For the ILLAN option: The platform must implement the
H_VIO_SIGNAL hcall() as defined in
.R1--37.For the ILLAN option: The platform must implement the
extensions to the H_EOI hcall() as defined in
.R1--38.For the ILLAN option: The platform must implement the
extensions to the H_XIRR hcall() as defined in
.R1--39.For the ILLAN option: The platform must implement the
H_PUT_TCE hcall().R1--40.For the ILLAN option: The platform must implement the
H_GET_TCE hcall().R1--41.For the ILLAN option: The platform must implement the
extensions to the H_MIGRATE_DMA hcall() as defined in
.R1--42.For the ILLAN option: The platforms must emulate the
standard PowerPC External Interrupt Architecture for the interrupt source
numbers associated with the virtual devices via the standard RTAS and
hypervisor interrupt calls.Logical LAN OptionsThe ILLAN option has several sub-options. The hypervisor reports to
the partition software when it supports one or more of these options, and
potentially other information about those option implementations, via the
implementation of the appropriate bits in the ILLAN Attributes, which can
be ascertained by the H_ILLAN_ATTRIBUTES hcall(). The same hcall() may be
used by the partition software to communicate back to the firmware the
level of support for those options where the firmware needs to know the
level of partition software support. The
“ibm,illan-options” property will exist
in the VIOA’s Device Tree node, indicating that the
H_ILLAN_ATTRIBUTES hcall() is implemented, and therefore that one or more
of the options are implemented. The following sections give more
details.ILLAN Backup Trunk Adapter OptionThe ILLAN Backup Trunk Adapter option allows the platform to
provide one or more backups to a Trunk Adapter, for reliability purposes.
Implementation of the ILLAN Backup Trunk Adapter option is specified to
the partition by the existence of the
“ibm,illan-options” property in the
VIOA’s Device Tree node and a non-0 value in the ILLAN Attributes
Backup Trunk adapter Priority field. A Trunk Adapter becomes the active
Trunk Adapter by calling H_ILLAN_ATTRIBUTES hcall() and setting its
Active Trunk Adapter bit. Only one Trunk Adapter is active for a VLAN at
a time. The protocols which determine which Trunk Adapter is active at
any particular time, is beyond the scope of this architecture.R1--1.For the ILLAN Backup Trunk Adapter option: The
platform must implement the ILLAN option.R1--2.For the ILLAN Backup Trunk Adapter option: The
platform must implement the H_ILLAN_ATTRIBUTES hcall().R1--3.For the ILLAN Backup Trunk Adapter option: The
platform must implement the
“ibm,illan-options” and
“ibm,trunk-adapter” properties in all the
Trunk Adapter nodes of the Device Tree.R1--4.For the ILLAN Backup Trunk Adapter option: The
platform must implement the Active Trunk Adapter bit and the Backup Trunk
Adapter Priority field in the ILLAN Attributes, as defined in
, for all Trunk Adapter
VIOAs.R1--5.For the ILLAN Backup Trunk Adapter option: The
platform must allow only one Trunk Adapter to be active for a VLAN at any
given time, and must:Make the determination of which one is active by whichever was
the most recent one to set its Active Trunk Adapter bit in their ILLAN
Attributes.Turn off the Active Trunk Adapter bit in the ILLAN Attributes
for a Trunk Adapter when it is removed from the active Trunk Adapter
state.ILLAN Checksum Offload Support OptionThis option allows for the support of IOAs that do checksum offload
processing. This option allows for support at one end (client or server)
but not the other, on a per-protocol basis, with the hypervisor
generating the checksum when the client supports offload but the server
does not, and the operation is a send from the client.GeneralThe H_ILLAN_ATTRIUBTES hcall is used to establish the common set of
checksum offload protocols to be supported between the firmware and the
partition software. The firmware indicates support for H_ILLAN_ATTRIBUTES
via the
“ibm,illan-options” property in the
VIOA’s Device Tree node. The partition software can determine which
of the Checksum Offload protocols (if any) that the firmware supports by
either attempting to set the bits in the ILLAN Attributes of the
protocols that the partition software supports or by calling the hcall()
with reset-mask and set-mask parameters of all-0’s (the latter
being just a query and not a request to support anything between the
partition and the firmware).Two bits in the control field of the first buffer descriptor
specify which operations do not contain a checksum and which have had
their checksum already verified. See
. These two bits get
transferred to the corresponding control field of the Receive Queue
Entry, with the exception that the H_SEND_LOGICAL_LAN hcall will
sometimes set these to 0b00 (see
).R1--1.For the ILLAN Checksum Offload Support option: The
platform must do all the following:Implement the ILLAN option.Implement the H_ILLAN_ATTRIBUTES hcall().Implement the
“ibm,illan-options” property in the
VIOA’s Device Tree node.Implement the appropriate Checksum Offload Support bit(s) of the
ILLAN Attributes, as defined in
.Software Implementation Note: Fragmentation and
encryption are not supported when the No Checksum bit of the Buffer
Descriptor is set to a 1.H_SEND_LOGICAL_LAN Semantic ChangesThere are several H_SEND_LOGICAL_LAN semantic changes required for
the ILLAN Checksum Offload Support option. See
for the base semantics.R1--1.For the ILLAN Checksum Offload Support option: The
H_SEND_LOGICAL_LAN semantics must be changed as follows:As shown in
, and for multi-cast
operations, the determination in this table must be applied for each
destination.If the No Checksum bit is set to a 1 in the first buffer
descriptor and the adapter is not a Trunk Adapter, and the source MAC
address does not match the adapter's MAC address, then drop the
packet.
Summary of H_SEND_LOGICAL_LAN Semantics with Checksum
OffloadHas Sender Set the Appropriate Checksum
Offload Support bit in the ILLAN Attributes for the Protocol
Being Used?Has Receiver Set the Appropriate
Checksum Offload Support bit in the ILLAN Attributes for the
Protocol Being Used?No Checksum bit in the Buffer
DescriptorChecksum Good bit in the Buffer
DescriptorH_SEND_LOGICAL_LAN Additional
SemanticsReceiver DD Additional
Requirementsno-00None.None.no-Either bit non-0Return H_Parameteryes-00None.None.yesno01Set the No Checksum and Checksum Good bits in the Buffer
Descriptor to 00 on transfer.None.yesno11Generate checksum and set the No Checksum and Checksum
Good bits in the Buffer Descriptor to 00 on transfer.None.yesyes01None.Do not need to do checksum checking.yesyes11None.Do not need to do checksum checking. Generate checksum if
packet is to be passed on to an external LAN (may be done by
the IOA or by the DD).--10Return H_Parameteryes-01 or 11 and packet type not supported by the hypervisor,
as indicated by the value returned by the H_ILLAN_ATTRIBUTES
hcall()Return H_Parameter
R1--2.For the ILLAN Checksum Offload Support option: The
Receiver DD Additional Requirements shown in
must be implemented.R1--3.For the ILLAN Checksum Offload Support option: When
the caller of H_SEND_LOGICAL_LAN has set the No Checksum bit in the
Control field to a 1, then they must also have set the checksum field in
the packet to 0, unless bit 46 in the ILLAN Attributes (the "Checksum Offload Non-zero
Checksum Field Support" bit) is set.Checksum Offload Padded Packet Support OptionFirmware may or may not support checksum offload for IPv4 packets
that have been padded. The Checksum Offload Padded Packet Support bit of
the ILLAN Attributes specifies whether or not this option is
supported.R1--1.For the Checksum Offload Padded Packet Support
Option: The platform must do all the following:Implement the ILLAN Checksum Offload Support option.Implement the Checksum Offload Padded Support bit of the ILLAN
Attributes, as defined in
, and set that bit to a value
of 1.ILLAN Buffer Size Control OptionIt is the partition software’s responsibility to keep
firmware supplied with enough buffers to keep packets from being dropped.
The ILLAN Buffer Size Control option gives the partition software a way
to prevent a flood of small packets from consuming buffers that have been
allocated for larger packets.When this option is implemented and the Buffer Size Control bit in
the ILLAN Attributes is set to a 1 for the VLAN, the hypervisor will keep
a history of what buffer sizes have been registered. Then, when a packets
arrives the history is searched to find the smallest buffer size that
will contain the packet. If that buffer size is depleted then the packet
is dropped by the hypervisor (H_Dropped) instead of searching for the
next larger available buffer.GeneralThe following are the general requirements for this option. For
H_SEND_LOGICAL_LAN changes, see
.R1--1.For the ILLAN Buffer Size Control option: The
platform must do all the following:Implement the ILLAN option.Implement the H_ILLAN_ATTRIBUTES hcall().Implement the
“ibm,illan-options” property in the
VIOA’s Device Tree node.Implement the Buffer Size Control bit of the ILLAN Attributes,
as defined in
.H_SEND_LOGICAL_LAN Semantic ChangesThe following are the required semantic changes to the
H_SEND_LOGICL_LAN hcall().R1--1.For the ILLAN Buffer Size Control option: When the
Buffer Size Control bit of the target of an H_SEND_LOGIC_LAN hcall() is
set to a 1, then the firmware for the H_SEND_LOGICAL_LAN hcall() must not
just search for any available buffer into which the packet will fit, but
must instead only place the packet into the receiver’s buffer if
there is an available buffer of the smallest size previously registered
by the receiver which will fit the packet, and must drop the packet for
that target otherwise.ILLAN Large Send Indication optionThis option allows the virtual device to send an
indication to the receiver that the data being sent by
H_SEND_LOGICAL_LAN contains a large send packet.GeneralThe following are the general requirements for this option. For H_SEND_LOGICAL_LAN changes, see
.R1--1.For the ILLAN Large send indication option:
The platform must do all the following:Implement the H_ILLAN_ATTRIBUTES hcall().Implement the Large Send Indication bit of the ILLAN Attributes as defined in
.H_SEND_LOGICAL_LAN Semantic ChangesThe following are the required semantic changes to the H_SEND_LOGICAL_LAN hcall().R1--1.For the ILLAN Large send indication option:
When the Large Send Indication bit of the first buffer descriptor is set to 1,
then the firmware for the H_SEND_LOGICAL_LAN hcall() must set the Large Send Indication
bit in the receiver's receive queue entry to 1 when the packet is copied to the
destination receive buffer.Virtual SCSI (VSCSI)Virtual SCSI (VSCI) support is provided by code running in a server
partition that uses the mechanisms of the Reliable Command/Response
Transport and Logical Remote DMA of the Synchronous VIO Infrastructure to
service I/O requests for code running in a client partition, such that, the
client partition appears to enjoy the services of its own SCSI adapter (see
). The terms server and client
partitions refer to platform partitions that are respectively servers and
clients of requests, usually I/O operations, using the physical I/O
adapters (IOAs) that are assigned to the server partition. This allows a
platform to have more client partitions than it may have physical I/O
adapters because the client partitions share I/O adapters via the server
partition.The VSCSI architecture is built upon the architecture specified in
the following sections:VSCSI GeneralThis section contains an informative outline of the architectural
intent of the use of the Synchronous VIO Infrastructure to provide VSCSI
support, along with a few architectural requirements. Other
implementations of the server and client partition code, consistent with
this architecture, are possible and may be preferable.The architectural metaphor for the VSCSI subsystem is that the
server partition provides the virtual equivalent of a single SCSI
DASD/Media string via each VSCSI server virtual IOA. The client partition
provides the virtual equivalent of a single port SCSI adapter via each
VSCSI client IOA. The platform, through the partition definition,
provides means for defining the set of virtual IOA’s owned by each
partition and their respective location codes. The platform also
provides, through partition definition, instructions to connect each
client partition’s VSCSI client IOA to a specific server
partition’s VSCSI server IOA. That is, the equivalent of connecting
the adapter cable to the specific DASD/Media string. The mechanism for
specifying this partition definition is beyond the scope of this
architecture. The human readable handle associated with the partition
definition of virtual IOAs and their associated interconnection and
resource configuration is the virtual location code. The OF unit address
(Unit ID) remains the invariant handle upon which the
OS builds its “physical to logical” configuration.The client partition’s device tree contains one or more nodes
notifying the partition that it has been assigned one or more virtual
adapters. The node’s
“type” and
“compatible” properties notify the
partition that the virtual adapter is a VSCSI adapter. The
unit address of the node is used by the client
partition to map the virtual device(s) to the OS’s corresponding
logical representations. The
“ibm,my-dma-window” property communicates
the size of the RTCE table window panes that the hypervisor has
allocated. The node also specifies the interrupt source number that has
been assigned to the Reliable Command/Response Transport connection and
the RTCE range that the client partition device driver may use to map its
memory for access by the server partition via Logical Remote DMA. The
client partition, uses the four hcall()s associated with the Reliable
Command/Response Transport facility to register and deregister its CRQ,
manage notification of responses, and send command requests to the server
partition.The server partition’s device tree contains one or more
node(s) notifying the partition that it is requested to supply VSCSI
services for one or more client partitions. The unit address (
Unit ID) of the node is used by the server partition
to map to the local logical devices that are represented by this VSCSI
device. The node also specifies the interrupt source number that has been
assigned to the Reliable Command/Response Transport connection and the
RTCE range that the server partition device driver may use for its copy
Logical Remote DMA. The server partition uses the four hcall()s
associated with the Reliable Command/Response Transport facility to
register and deregister its Command request queue, manage notification of
new requests, and send responses back to the client partition. In
addition, the server partition uses the hcall()s of the Logical Remote
DMA facility to manage the movement of commands and data associated with
the client requests.The client partition, upon noting the device tree entry for the
virtual adapter, loads the device driver associated with the value of the
“compatible” property. The device driver,
when configured and opened, allocates memory for its CRQ (an array, large
enough for all possible responses, of 16 byte elements), pins the queue
and maps it into the I/O space of the RTCE window specified in the
“ibm,my-dma-window” property using the
standard kernel mapping services that subsequently use the H_PUT_TCE
hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next,
I/O request control blocks (within which the I/O requests commands are
built) are allocated, pinned, and mapped into I/O address space. Finally,
the device driver registers to receive control when the interrupt source
specified in the virtual IOA’s device tree node signals.Once the CRQ is setup, the device driver queues an Initialization
Command/Response with the second byte of “Initialize” in
order to attempt to tell the hosting side that everything is setup on the
hosted side. The response to this send may be that the send has been
dropped or has successfully been sent. If successful, the sender should
expect back an Initialization Command/Response with a second byte of
“Initialization Complete,” at which time the communication
path can be deemed to be open. If dropped, then the sender waits for the
receipt of an Initialization Command/Response with a second byte of
“Initialize,” at which time an “Initialization
Complete” message is sent, and if that message is sent
successfully, then the communication path can be deemed to be
open.When the VSCSI Adapter device driver receives an I/O request from
one of the SCSI device head drivers, it executes the following sequence.
First an I/O request control block is allocated. Then it builds the SCSI
request within the control block, adds a correlator field (to be returned
in the subsequent response), I/O maps any target memory buffers and
places their DMA descriptors into the I/O request control block. With the
request constructed in the I/O request control block, the driver
constructs a DMA descriptor (Starting Offset, and length) representing
the I/O request within the I/O request control block. Lastly, the driver
passes the I/O request’s DMA descriptor to the server partition
using the H_SEND_CRQ hcall(). Provided that the H_SEND_CRQ hcall()
succeeds, the VSCSI Adapter device driver returns, waiting for the
response interrupt indicating that a response has been posted by the
server partition to the device driver’s response queue. The
response queue entry contains the summary status and request correlator.
From the request correlator, the device driver accesses the I/O request
control block, and from the summary status, the device driver determines
how to complete the processing of the I/O request.Notice that the client partition only uses the Reliable
Command/Response Transport primitives; it does not use the Logical Remote
DMA primitives. Since the server partition’s RTCE tables are not
authorized for access by the client partition, any attempt by the client
partition to modify server partition memory would be prevented by the
hypervisor. RTCE table access is granted on a connection by connection
basis (client/server virtual device pair). If a client partition happens
to be serving some other logical device, then the partition is entitled
to use Logical Remote DMA for the virtual devices that is serving.The server partition, upon noting the device tree entry for the
virtual adapter, loads the device driver associated with the value of the
“compatible” property. The device driver,
when configured and opened, allocates memory for its request queue (an
array, large enough for all possible outstanding requests, of 16 byte
elements). The driver then pins the queue and maps it into I/O space, via
the kernel’s I/O mapping services that invoke the H_PUT_TCE
hcall(), using the first window pane specified in the
“ibm,my-dma-window” property. The queue
is then registered using the H_REG_CRQ hcall(). Next, I/O request control
blocks (within which the I/O request commands are built) are allocated,
pinned, and I/O mapped. Finally the device driver registers to receive
control when the interrupt source specified in the virtual IOA’s
device tree node signals.Once the CRQ is setup, the device driver queues an Initialization
Command/Response with the second byte of “Initialize” in
order to attempt to tell the hosted side that everything is setup on the
hosting side. The response to this send may be that the send has been
dropped or has successfully been sent. If successful, the sender should
expect back an Initialization Command/Response with a second byte of
“Initialization Complete,” at which time the communication
path can be deemed to be open. If dropped, then the sender waits for the
receipt of an Initialization Command/Response with a second byte of
“Initialize,” at which time an “Initialization
Complete” message is sent, and if that message is sent
successfully, then the communication path can be deemed to be
open.When the server partition’s device driver receives an I/O
request from its corresponding client partition’s VSCSI adapter
drivers, it is notified via the interrupt registered for above. The
server partition’s device driver selects an I/O request control
block for the requested operation. It then uses the DMA descriptor from
the request queue element to transfer the SCSI request from the client
partition’s I/O request control block to its own (allocated above),
using the H_COPY_RDMA hcall() through the second window pane specified in
the
“ibm,my-dma-window” property. The server
partition’s device driver then uses kernel services, that are
extended, to register the I/O request’s DMA descriptors into
extended capacity cross memory descriptors (ones capable of recording the
DMA descriptors). These cross memory descriptors are later mapped by the
server partition’s physical device drivers into the physical I/O
DMA address space of the physical I/O adapters using the kernel services,
that have been similarly extended to call the H_PUT_RTCE hcall(), based
upon the value of the LIOBN field reference by the cross memory
descriptor. At this point, the server partition’s VSCSI device
driver delivers what appears to be a SCSI request to be decoded and
routed through the server partition’s file sub-system for
processing. When the request completes, the server partition’s
VSCSI device driver is called by the file sub-system and it packages the
summary status along with the request correlator into a response message
that it sends to the client partition using the H_SEND_CRQ hcall(), then
recycles the resources recorded in the I/O request control block, and the
block itself.The LIOBN value in the second window pane of the server
partition’s
“ibm,my-dma-window” property is intended
to be an indirect reference to the RTCE table of the client partition.
If, for some reason, the physical location of the client
partition’s RTCE table changes or it becomes invalid, this level of
indirection allows the hypervisor to determine the current target without
changing the LIOBN number as seen by the server partition. The H_PUT_TCE
and H_PUT_RTCE hcall()s do not map server partition memory into the
second window pane; the second window pane is only available for use by
server partition via the Logical RDMA services to reference memory mapped
into it by the client partition’s IOA.This architecture does not specify the payload format of the
requests or responses. However, the architectural intent is supplied in
the following tables for reference.
General Form of Reliable CRQ ElementByte OffsetField NameSubfield NameDescription0HeaderContains Element Valid Bit plus Event Type Encodings (see
).1PayloadFormat/Transport Event CodeFor Valid Command Response Entries, see
. For Transport Event
Codes see
.2-15Format Dependent.
Example Reliable CRQ Entry Format Byte Definitions for VSCSIFormat Byte ValueDefinition0x0Unused0x1VSCSI Requests0x2VSCSI Responses0x03 - 0xFEReserved0xFFReserved for Expansion
Example VSCSI Command Queue ElementByte OffsetField ValueDescription00x80Valid Header10x01VSCSI Request Format2-3NAReserved4-7Length of the request block to be transferred8-15I/O address of beginning of request
See also
.Virtual SCSI RequirementsThis normative section provides the general requirements for the
support of VSCSI.R1--1.For the VSCSI option: The platform must implement the
Reliable Command/Response Transport option as defined in
.R1--2.For the VSCSI option: The platform must implement the
Logical Remote DMA option as defined in
.In addition to the firmware primitives, and the structures they
define, the partition’s OS needs to know specific information
regarding the configuration of the virtual IOA’s that it has been
assigned so that it can load and configure the correct device driver
code. This information is provided by the OF device tree node associated
with the virtual IOA (see
and
).Client Partition Virtual SCSI Device Tree NodeClient partition VSCSI device tree nodes have associated packages
such as disk-label, deblocker, iso-13346-files and iso-9660-files as well
as children nodes such as block and byte as appropriate to the specific
virtual IOA configuration as would the node for a physical IOA of type
scsi-3.R1--1.For the VSCSI option: The platform’s OF device
tree for client partitions must include as a child of the
/vdevice node, a node of name “v-scsi” as
the parent of a sub-tree representing the virtual IOAs assigned to the
partition.R1--2.For the VSCSI option: The platform’s
v-scsi OF node must contain properties as defined in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the VSCSI Node in the Client
PartitionProperty NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name, the
value shall be
“v-scsi”.“device_type”YStandard property name per
,
specifying the virtual device type, the
value shall be
“vscsi”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA, the value shall include
“IBM,v-scsi”.
“IBM,v-scsi-2” precedes
“IBM,vsci” if it is included in
the value of this property.“used-by-rtas”See Definition ColumnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form specified in
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of three values
(LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int.“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall().“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.“ibm,#dma-size-cells”See Definition ColumnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See Definition ColumnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.
R1--3.For the VSCSI option: The platform’s
v-scsi node must have as children the appropriate
block (disk) and byte
(tape) nodes.Server Partition Virtual SCSI Device Tree NodeServer partition VSCSI IOA nodes have no children nodes.R1--1.For the VSCSI option: The platform’s OF device
tree for server partitions must include as a child of the
/vdevice node, a node of name
“v-scsi-host” as the parent of a sub-tree
representing the virtual IOAs assigned to the partition.R1--2.For the VSCSI option: The platform’s
v-scsi-host node must contain properties as defined
in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the VSCSI Node in the Server
PartitionProperty NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name, the
value shall be
“v-scsi-host”.“device_type”YStandard property name per
,
specifying the virtual device type, the
value shall be
“v-scsi-host”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA, the value shall include
“IBM,v-scsi-host”.
“IBM,v-scsi-host-2” precedes
“IBM,vsci-host” if it is
included in the value of this property.
“used-by-rtas”See Definition ColumnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of two sets (two
window panes) of three values (LIOBN, phys, size) encoded as
with
encode-int,
encode-phys, and
encode-int. Of these two triples, the
first describes the window pane used to map server partition
memory, the second is the window pane through which the client
partition maps its memory that it makes available to the server
partition. (Note the mapping between the LIOBN in the second
window pane of a server virtual IOA’s
“ibm,my-dma-window” property
and the corresponding client IOA’s RTCE table is made
when the CRQ successfully completes registration. See
for more information
on window panes.)“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall()“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.“ibm,vserver”YProperty name specifying that this is a virtual server
node.“ibm,#dma-size-cells”See Definition ColumnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See Definition ColumnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.
Virtual Terminal (Vterm)This section defines the Virtual Terminal (Vterm) options (Client
Vterm option and Server Vterm option). Vterm IOAs are of the hypervisor
simulated class of VIO. See also
.Vterm GeneralThis section contains an informative outline of the architectural
intent of the use of Vterm support.The architectural metaphor for the Vterm IOA is that of an Async
IOA. The connection at the other end of the Async “cable” may
be another Vterm IOA in a server partition, the hypervisor, or the
HMC.A partition’s device tree contains one or more nodes
notifying the partition that it has been assigned one or more Vterm
client adapters (each LoPAR partition has at least one). The
node’s
“type” and
“compatible” properties notify the
partition that the virtual adapter is a Vterm client adapter. The
unit address of the node is used by the partition to
map the virtual device(s) to the OS’s corresponding logical
representations. The node’s
“interrupts” property, if it exists,
specifies the interrupt source number that has been assigned to the
client Vterm IOA for receive data. The partition, uses the
H_GET_TERM_CHAR and H_PUT_TERM_CHAR hcall()s to receive data from and
send data to the client Vterm IOA. If the node contains the
“interrupts” property, the partition may
optionally use the
ibm,int-on,
ibm,int-off,
ibm,set-xive, ibm,get-xive RTAS calls, and the
H_VIO_SIGNAL hcall() to manage the client Vterm IOA interrupt.A partition’s device tree may also contain one or more
node(s) notifying the partition that it is requested to supply server
Vterm IOA services for one or more client Vterm IOAs. The node’s
“type” and
“compatible” properties notify the
partition that the virtual adapter is a server Vterm IOA. The unit
address (Unit ID) of the node is used by the partition to map
the virtual device(s) to the OS’s corresponding logical
representations. The node’s
“interrupts” property specifies the
interrupt source number that has been assigned to the server Vterm IOA
for receive data. The partition uses the H_VTERM_PARTNER_INFO hcall() to
find out which unit address(es) in which partition(s) to which it is
allowed to attach (that is, to which client Vterm IOAs it is allowed to
attach). The partition then uses the H_REGISTER_VTERM to setup the
connection between a server and a client Vterm IOAs, and uses the
H_GET_TERM_CHAR and H_PUT_TERM_CHAR hcall()s to receive data from and
send data to the server Vterm IOA. In addition, the partition may
optionally use the
ibm,int-on,
ibm,int-off,
ibm,set-xive, ibm,get-xive RTAS calls, and the
H_VIO_SIGNAL hcall() to manage the server Vterm IOA interrupt. shows a comparison between the
client and server versions of Vterm.
Client Vterm versus Server Vterm ComparisonClientServerThe following hcall()s apply:
H_PUT_TERM_CHAR
H_GET_TERM_CHAR
H_VIO_SIGNAL (optional use with Client)N/AThe following hcall()s are valid:
H_VTERM_PARTNER_INFO
H_REGISTER_VTERM
H_FREE_VTERMvty nodevty-server nodeThe
“reg” property or the
vty node(s)enumerates the valid client Vterm IOA unit
address(es)The
“reg” property or the
vty-server node(s) enumerates the valid server Vterm IOA unit
address(es)H_VTERM_PARTNER_INFO is used to getvalid client Vterm IOA partition ID(s) and corresponding
unit address(es) to which the server Vterm IOA is allowed to
connect“interrupts” property
optional:Platform may or may not provide
If provided, Vterm driver may or may not use“interrupts” property
required:Platform must provide
If provided, Vterm driver may or may not use
Vterm RequirementsThis normative section provides the general requirements for the
support of Vterm.R1--1.For the LPAR option:
the Client Vterm option must be implemented.Character Put and Get hcall()sThe following hcall()s are used to send data to and get data from
both the client and sever Vterm IOAs.H_GET_TERM_CHARSyntax:Parameters:termno: The unit address of the Vterm IOA, from the
“reg” property of the Vterm IOA.Semantics:Hypervisor checks the termno parameter for validity against the
Vterm IOA unit addresses assigned to the partition, else return
H_Parameter.Hypervisor returns H_Hardware if it detects that the virtual
console terminal physical connection is not working.Hypervisor returns H_Closed if it detects that the virtual
console associated with the termno parameter is not open (in the case of
connection to a server Vterm IOA, this means that the server code has not
made the connection to this specific client Vterm IOA).Hypervisor returns H_Success in all other cases, returning
maximum number of characters available in the partition’s virtual
console terminal input buffer (up to 16) -- a len value of 0 indicates
that the input buffer is empty.Upon return with H_Success register R4 contains the number of
bytes (if any) returned in registers R5 and R6.Upon return with H_Success the return character string starts in
the high order byte of register R5 and proceeds toward the low order byte
in register R6 for the number of characters specified in R4. The contents
of all other byte locations of registers R5 and R6 are undefined.Upon return with H_Success register R4 contains the number of
bytes (if any) returned in registers R5 and R6.Upon return with H_Success the return character string starts in
the high order byte of register R5 and proceeds toward the low order byte
in register R6 for the number of characters specified in R4. The contents
of all other byte locations of registers R5 and R6 are undefined.H_PUT_TERM_CHARSyntax:Parameters:termno: The unit address of the Vterm IOA, from the
“reg” property of the Vterm IOA.len: The length of the string to transmit through the virtual
terminal port. Valid values are in the range of 0-16.char0_7 and char8_15: The string starts in the high order byte of
register R6 and proceeds toward the low order byte in register R7Semantics:Hypervisor checks the termno parameter for validity against the
Vterm IOA unit addresses assigned to the partition, else return
H_Parameter.Hypervisor returns H_Hardware if it detects that the virtual
console terminal physical connection is not working.Hypervisor returns H_Closed if it detects that the virtual
console session is not open (in the case of connection to a server Vterm
IOA, this means that the server code has not made the connection to this
specific client Vterm IOA).If the length parameter is outside of the values 0-16 the
hypervisor immediately returns H_Parameter with no other action.If the partition’s virtual console terminal buffer has room
for the entire string, the hypervisor queues the output string and
returns H_Success.
Note: There is always room for a zero length string (a
zero length write can be used to test the virtual console terminal
connection).If the buffer cannot hold the entire string, no data is enqueued
and the return code is H_Busy.InterruptsThe interrupt source number is presented in the
“interrupts” property of the
Vterm node, when receive queue interrupts are
implemented for the Vterm. The
ibm,int-on,
ibm,int-off,
ibm,set-xive, ibm,get-xive RTAS calls, and
H_VIO_SIGNAL hcall() are used to manage the interrupt.Interrupts and the
“interrupts” property are always
implemented for the server Vterm IOA, and may be implemented for the
client Vterm IOA.The interrupt mechanism is edge-triggered and is capable of
presenting only one interrupt signal at a time from any given interrupt
source. Therefore, no additional interrupts from a given source are ever
signaled until the previous interrupt has been processed through to the
issuance of an H_EOI hcall(). Specifically, even if the interrupt mode is
enabled, the effect is to interrupt on an empty to non-empty transition
of the receiver queue or upon the closing of the connection between the
server and client. However, as with any asynchronous posting operation
race conditions are to be expected. That is, an enqueue can happen in a
window around the H_EOI hcall() so that the receiver should poll the
receive queue after an H_EOI using H_GET_TERM_CHAR after an H_EOI, to
prevent losing initiative.R1--1.For the Server Vterm option: The platform must
implement the
“interrupts” property in all server
Vterm device tree nodes (
vty-server), and must set the interrupt in that
property for the receive data interrupt for the IOA.R1--2.For the Client Vterm and Server Vterm options: When
implemented, the characteristics of the Vterm interrupts must be as
follows:All must be edge-triggered.The receive interrupt must be activated when the Vterm receive
queue goes from empty to non-empty.The receive interrupt must be activated when the Vterm
connection from the client to the server goes from open to closed.Client Vterm Device Tree Node (vty)All platforms that implement LPAR, also implement at least one
client Vterm IOA per partition.R1--1.For the Client Vterm option: The H_GET_TERM_CHAR and
H_TERM_CHAR hcall()s must be implemented.R1--2.For the Client Vterm option: The platform’s OF
device tree must include as a child of the
/vdevice node, one or more nodes of type
“vty”; one for each client Vterm IOA.R1--3.For the Client Vterm option: The platform’s
vty OF node must contain properties as defined in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the
vty Node (Client Vterm IOA)Property NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name. The
value shall be
“vty”.“device_type”YStandard property name per
,
specifying the virtual device type. The
value shall be
“serial”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA. The value shall include
“hvterm1” when the virtual IOA
will connect to a server with no special protocol, and shall
include
“hvterm-protocol” when the
virtual IOA will connect to a server that requires a protocol
to control modems or hardware control signals.“used-by-rtas”NAProperty not present.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form specified in
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“interrupts”See Definition ColumnStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall(). If provided, this property will present one interrupt;
the receive data interrupt.“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.
R1--4.For the Client Vterm option: If the compatible
property in the
vty node is
“hvterm-protocol”, then the protocol
that the client must use is defined in the document entitled
Protocol for Support of Physical Serial Port Using a Virtual
TTY Interface.Server VtermServer Vterm IOAs allow a partition to serve a partner
partition’s client Vterm IOA.Server Vterm Device Tree Node (vty-server) and Other
RequirementsR1--1.For the Server Vterm option: The H_GET_TERM_CHAR,
H_PUT_TERM_CHAR, H_VTERM_PARTNER_INFO, H_REGISTER_VTERM, and H_FREE_VTERM
hcall()s must be implemented.R1--2.For the Server Vterm option: The platform’s OF
device tree for partitions implementing server Vterm IOAs must include as
a child of the
/vdevice node, one or more nodes of type
“vty-server”; one for each server Vterm
IOA.R1--3.For the Server Vterm option: The platform’s
vty-server node must contain properties as defined in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the
vty-server Node (Server Vterm IOA)Property NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name. The
value shall be
“vty-server”.“device_type”YStandard property name per
,
specifying the virtual device type. The
value shall be
“serial-server”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA. The value shall include
“hvterm2”.“used-by-rtas”NAProperty not present.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length specified by
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall(). This property will present one interrupt; the receive
data interrupt.“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.“ibm,vserver”YProperty name specifying that this is a virtual server
node.
Server Vterm hcall()sThe following hcall()s are unique to the server Vterm IOA.H_VTERM_PARTNER_INFOThis hcall is used to retrieve the list of Vterms to which the
specified server Vterm IOA is permitted to connect. The list is retrieved
by making repeated calls, and returns sets of triples: partner partition
ID, partner unit address, and partner location code. Passing in the
previously returned value will return the next value in the list of
allowed connections. Passing in a value of 0xFF...FF will return the
first value in the list.Syntax:Parameters:unit-address: Virtual IOA’s unit address, as specified in
the IOA’s device tree node.partner-partition-id: The partner partition ID of the last
partner partition ID and partner unit address pair returned. If a value
of 0xFF...FF is specified, the call returns the first item in the
list.partner-unit-addr: The partner unit address of the last partner
partition ID and partner unit address pair returned. If a value of
0xFF...FF is specified, the call returns the first item in the
list.buffr-addr: The logical address of a single page in memory,
belonging to the calling partition, which is used to return the next
triple of information (partner partition ID, partner unit address, and
Converged Location Code). The calling partition cannot migrate the page
during the duration of the call, otherwise the call will fail.Buffer format on return with H_Success:First eight bytes: Eight byte partner partition ID of the partner
partition ID and partner unit address pair from the list, or 0xFF...FF if
partner partition ID and partner unit address passed in the input
parameters was the last item in the list of valid connections.Second eight bytes: Eight byte partner unit address associated
with the partner partition ID (as returned in first 8 bytes of the
buffer), or 0xFF...FF if the partner partition ID and partner unit
address passed in the input parameters was the last item in the list of
valid connections.Beginning at the 17 byte in the buffer: NULL-terminated Converged
Location Code associated with the partner unit address and partner
partition ID (a returned in the first 16 bytes of the buffer), or just a
NULL string if the partner partition ID and partner unit address passed
in the input parameters was the last item in the list of valid
connections.Semantics:Validate that unit-address belongs to the partition and to a
server Vterm IOA, else H_Parameter.If partner-partition-id and partner-unit-addr together do not
match a valid partner partition ID and partner unit address pair in the
list of valid connections for this unit-address, then return
H_Parameter.If the 4 KB page associated with buffr-addr does not belong to
the calling partition, then return H_Parameter.If the buffer associated with buffr-addr does not begin on a 4 K
boundary, then return H_Parameter.If the calling partition attempts to migrate the buffer page
associated with buffr-addr during the duration of the
H_VTERM_PARTNER_INFO call, then return H_Parameter.If partner-partition-id is equal to 0xFF...FF, then select the
first item from the list of valid connections, format the buffer as
specified, above, for this item, and return H_Success.If partner-partition-id and partner-unit-addr together matches a
valid partner partition ID and partner unit address pair in the list of
valid connections, and if this is the last valid connection in the list,
then format the buffer as specified, above, with the partner partition ID
and partner unit address both set to 0xFF...FF, and the Converged
Location Code set to the NULL string, and return H_Success.If partner-partition-id and partner-unit-addr together matches a
valid partner partition ID and partner unit address pair in the list of
valid connections, then select the next item from the list of valid
connections, and format the buffer as specified, above, and return
H_Success.H_REGISTER_VTERMThis hcall has the effect of “opening” the connection
to the client Vterm IOA in the specified partition ID and which has the
specified unit address. The architectural metaphor for this is the
connecting of the cable between two Async IOAs. The hcall fails if the
partition does not have the authority to connect to the requested
partition/unit address pair. The hcall() also fails if the specified
partition/unit address is already in use (for example, by another
partition or the HMC)Syntax:Parameters:unit-address: The server Virtual IOA’s unit address, as
specified in the IOA’s device tree node.partner-partition-id: The partition ID of the partition ID and
unit address pair to which to be connected.partner-unit-addr: The unit address of the partition ID and unit
address pair to which to be connected.Semantics:Validate that unit-address belongs to the partition and to a
server Vterm IOA and that there does not exist a valid connection between
this server Vterm IOA and a partner, else H_Parameter.If partner-partition-id and partner-unit-addr together do not
match a valid partition ID and unit address pair in the list of valid
connections for this unit-address, then return H_Parameter,Else make connection between the server Vterm IOA specified by
unit-address and the client Vterm IOA specified by the
partner-partition-id and partner-unit-addr pair, allowing future
H_PUT_TERM_CHAR and H_GET_TERM_CHAR operations to flow between the two
Vterm IOAs, and return H_Success.Software Implementation Note: An H_Parameter will be
returned to the H_REGISTER_VTERM if a DLPAR operation has been performed
which changes the list of possible server to client Vterm connections.
After a DLPAR operation which affects a partition’s server Vterm
IOA connection list, a call to H_VTERM_PARTNER_INFO is needed to get the
current list of possible connections.H_FREE_VTERMThis hcall has the effect of “closing” the connection
to the partition/unit address pair. The architectural metaphor for this
is the removal of the cable between two Async IOAs. After closing, the
partner partition’s server Vterm IOA would now be available for
serving by another partner (for example, another partition or the
HMC).Syntax:Parameters:unit-address: Virtual IOA’s unit address, as specified in
the IOA’s device tree node.Semantics:Validate that the unit address belongs to the partition and to a
server Vterm IOA and that there exists a valid connection between this
server Vterm IOA and a partner, else H_Parameter.Break the connection between the server Vterm IOA specified by
the unit address and the client Vterm IOA, preventing further
H_PUT_TERM_CHAR and H_GET_TERM_CHAR operations between the two Vterm IOAs
(until a future successful H_REGISTER_VTERM operation), and return
H_Success.Implementation Note: If the hypervisor returns an
H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must
call H_FREE_VTERM again with the same parameters. Software may choose to
treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same as H_Busy.
The hypervisor, prior to returning H_Busy, H_LongBusyOrder1mSec, or
H_LongBusyOrder10mSec, will have placed the virtual adapter in a state
that will cause it to not accept any new work nor surface any new virtual
interrupts.Virtual Management Channel (VMC)PAPR Virtual Management Channel (VMC) support is provided
by code running in a logical partition that uses the
mechanisms of the Reliable Command/Response Transport and
Logical Remote DMA of the Synchronous VIO Infrastructure
to service and to send requests to platform code. The
purpose of this interface is to communicate platform
management information between designated logical
partition and the platform.The VMC architecture is built upon the architecture specified in the following sections:Virtual Management Channel RequirementsThis normative section provides the general requirements for the support of VMC.R1--1.For the VMC option:The platform must
implement the Reliable Command/Response Transport option
as defined in .R1--2.For the VMC option:The platform must
implement the Logical Remote DMA option as defined in
.In addition to the firmware primitives, and the structures
they define, the partition’s OS needs to know specific information
regarding the configuration of the virtual IOAs that it has been
assigned so that it can load and configure the
correct device driver code. This information is provided by the Open
Firmware device tree node associated with the
virtual IOA (see ).Partition Virtual Management Channel Device Tree NodePartition VMC IOA nodes have no children nodes.R1--1.For the VMC option: The platform’s
Open Firmware device tree for client partitions must include as
a child of the /vdevice
node, a node of type “vmc” as the
parent of a sub-tree representing the virtual IOAs
assigned to the partition.R1--2.For the VMC option: The platform’s
vmc
Open Firmware node must contain properties as defined in
(other standard I/O adapter properties are permissible as appropriate).
Properties of the VMC Node in the Client PartitionProperty NameRequired?Definition“name”YStandard property name per IEEE 1275 specifying the virtual
device name, the value shall be
“ibm,vmc”.“device_type”YStandard property name per IEEE 1275 specifying the virtual
device type, the value shall be
“ibm,vmc”.“model”NAProperty not present.“compatible”YStandard property name per IEEE 1275 specifying the programming
models that are compatible with this virtual IOA, the value shall be
“IBM,vmc”.“used-by-rtas”See Definition ColumnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and
persistent location code associated with this virtual IOA
presented as an encoded array as with
encode-string.
The value shall be of the form specified in
information on
Virtual Card Connector Location Codes.“reg”YStandard property name per IEEE 1275 specifying the
register addresses, used as the unit address (unit
ID), associated with this virtual IOA presented as
an encoded array as with encode-phys of length
“#address-cells”
value shall be 0xwhatever (virtual
“reg”
property used for unit address no actual locations used, therefore, the size field
has zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window
associated with this virtual IOA presented as an encoded
array of two sets (two window panes) of three values (LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int.
Of these two triples, the first describes the window
pane used to map server partition (the
designated management partition) memory, the second is the
window pane through which the client partition
(the platform partition) maps its memory that it makes
available to the server partition.“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual
IOA presented as an encoded array of two cells encoded as with
encode-int with the first cell
containing the interrupt source number, and the
second cell containing the sense code 0 indicating positive
edge triggered. The interrupt source number being the value
returned by the H_XIRR or H_IPOLL hcall().“ibm,my-drc-index”For DRPresent if the platform implements DR for this node.“ibm,#dma-size-cells”See Definition ColumnProperty name, to define the package’s dma address
size format. The property value specifies the number
of cells that are used to encode the size field of
dma-window properties. This property is present when the
dma address size format cannot be derived using
the method described in the definition for the
“ibm,#dma-size-cells”
property in
section on System Bindings.“ibm,#dma-address-cells”See Definition ColumnProperty name, to define the package’s dma address
format. The property value specifies the number
of cells that are used to encode the physical address field of
dma-window properties. This property is present when the
dma address format cannot be derived using
the method described in the definition for the
“ibm,#dma-address-cells”
property in
section on System Bindings.
Virtual Asynchronous Services Interface (VASI)The PAPR Virtual Asynchronous Services Interface (VASI)
allows an authorized virtual server partition (VSP) to
safely access the internal state of a specific partition.
The access provided by VASI enables high level administrative
services such as partition migration, hibernation and
virtualization of partition logical real memory. VASI uses the
mechanisms of the Reliable Command/Response Transport a
nd Logical Remote DMA of the Synchronous VIO Infrastructure
to service requests.VASI is built upon the architecture specified in the following sections: VASI OverviewVASI Streams, Services and StatesA single VASI virtual IOA may be capable of supporting
multiple streams of operations (up to the number presented in the
“ibm,#vasi-streams”
property, see
)
each representing a specific high level operation such as an
individual logical partition migration, or a unique logical
partition hibernation, etc. The hypervisor and the various
logical partitions use the VASI_Stream_ID as a handle to associate
the services that each provide to the specific high level function.
Similarly a single VASI virtual IOA may be
capable of supporting multiple service sessions (opens) for
each VASI_Stream_ID (up to the number negotiated by the
#Opens field of the capabilities string, see
).VASI streams and individual service sessions may be in
one of several states. Refer to the specific high level function description in
for the state descriptions and transition triggers that are defined for each high level function.VASI HandlesVASI defines several versions of handles. The VASI Stream
ID is used to associate the elements of the same high level
function (such as a specific partition migration operation).
In this case, the various partitions are assigned roles and a
VASI Stream ID. By opening a VASI virtual IOA with a given
VASI Stream ID and Service parameter, the partition declares
its intent to perform the specified service for the specific
high level operation. By means outside the scope of
PAPR, the platform is told to expect such service from the
specific partition; thus when the match happens, the high
level operation is enabled. At open time, the platform and
partition negotiate a pair of convenient handles to use as a
substitute for the architecturally opaque VASI Stream ID.
This pair of 4 byte handles are called the TOP/BOTTOM.
The TOP field is used by the partition to denote its
operations for the specific VASI Stream ID, while the BOTTOM
field provides that function for the platform firmware.The first 8 bytes of a VASI data buffer are reserved for
Virtual Server Partition (VSP) use as the buffer correlator field.
The buffer correlator field is architecturally opaque. The
architectural intent is that the buffer correlator field is a VSP
handle to the data buffer control block.Semantic ConventionsThe convention for the specification of VASI CRQ message
processing semantics is via a specifically ordered sequence
of operations. Implementations are not required to code in these
sequences but are required to appear as if they
did. In general, parameters and operational state are first
verified, followed by the operation to be performed if all the
parameter/state checks succeed. If a check fails, a response is
generated at that point and further processing of the message
is abandoned. Note that standard CRQ processing operations
(message enqueue and dequeue processing such as
finding the next valid message, resetting the
valid message bit when processing is complete, etc. (See
)
are assumed and not explicitly included in the semantic
specification.VASI Data Buffers (Normative)Data buffers used by VASI are defined as from ILLAN (See
).
VASI references data buffers via a valid buffer descriptor (Control
Byte = 0x80) as from ILLAN (See
).
relative to first pane of the VASI virtual IOA
“ibm,my-dma-window” property. The first 8 bytes of a
data buffer are reserved for an OS opaque handle.
A filled data buffer contains either a VASI Download Request
Specifier or a VASI Operation Request Specifier; refer to
or
respectively, following the opaque handle. Buffers are supplied to
the VASI virtual IOA via the VASI Add Buffer CRQ request message,
and returned to the VASI device driver in operation requests such as the VASI
Operation CRQ request message or, for those that have not
been used by operation requests, via responses to the VASI
Free Buffer CRQ request message. Closing a VASI service
session releases buffers queued for that service session in
the VASI virtual IOA, while deregistering the VASI virtual
IOA CRQ does the same for all of the VASI virtual IOA
service sessions.R1--1.For the VASI option: The platform
must implement the Reliable Command/Response Transport option (See
).R1--2.For the VASI option: The storage
for VASI data buffers to be used by a given VASI virtual IOA
must be TCE mapped within the first pane of the
“ibm,my-dma-window”
property as presented in the device
tree node of the VASI virtual IOA Open Firmware
device tree node.R1--3.For the VASI option: Firmware must
not modify the first 8 bytes (Buffer Correlator field) of a VASI data buffer.R1--4.For the VASI option: Immediately following
the first 8 bytes of a filled VASI data buffer must be either
a VASI Download Request Specifier or a VASI Operation Request Specifier.R1--5.For the VASI option: The VASI Download
Request Specifier must be formatted as per
.R1--6.For the VASI option: The VASI Operation
Request Specifier must be formatted as per
.VASI Download Request SpecifierThe VASI Download Request Specifier is presented in
.
The VASI Download Request Specifier is used with the VASI Download Request
message see
.VASI Operation Request SpecifierThe VASI Operation Request Specifier is presented in
.
The TOP/BOTTOM (8 bytes) field is a pair of 4 byte opaque handles
as negotiated by the VASI Open Request/Response pair see
. Expected Semantics of VASI operation requests:Operation length is communicated by the summation of the
lengths of the buffer segment control structures following
the operation correlator field.Operations that write at the end of the file normally
extend the file. If extending the file is not possible due to resource
constraints, then the operation is aborted at the end of
the file, the VASI operation response message carries
the “End of File” status with the Residual field containing
the number of requested bytes that were not transferred
(Residual value of all ones indicates Residual field overflow).Read operations that access beyond the end of the file are
aborted at the end of the file. The VASI operation response
message carries the “End of File” status with the Residual
field containing the number of requested bytes
that were not transferred (Residual value of all
ones indicates Residual field overflow).Sequential writes deliver the input stream of bytes to the
receiver in the same order, but not necessarily in the same
blocking as originated by the sender.Index operations carry the additional semantic over the
corresponding sequential operation that they are a collection
of one or more sub-operations of the same type (read/write).
Each sub-operation specification starts with a
control field encoding of 0xC0 that carries the 512
byte file block index of the start of the operation. The file cursor
can then be adjusted within the block using a control
field of 0x40 followed by a 3 byte binary offset (legal
values 0-511). This offset allows operations to beginning
on any byte boundary within the specified 512 byte
block index. The remainder of each sub-operation
specification is a scatter gather list. The sub-operation length is
defined by the number of bytes of data/buffer
supplied in the sub-operation scatter gather list.The “Hardware” status code indicates a failure due to
any hardware problem including physical I/O.The “Invalid Buffer Correlator” status code is reserved
for failure to find the operation buffer.The “Invalid VASI Operation Request Specifier” status
code is used for any failure in the decode of the operation
buffer not specifically called out by a previous semantic.The first control field of a scatter gather list may be a
byte offset encoded with a control field of 0x40 and followed
by a 3 byte binary offset (legal values 0-511). This offset
allows operations to beginning on any byte boundary
within the specified 512 byte block index.The control field encoding 0xC0 indicates that the original
operation is conjoined with a second indexed operation
of the same direction starting at a new 512 byte block
index (as indicated in the following 7 bytes). The conjoined
index operation has its own scatter gather list optionally
starting with a byte offset, followed by one or more data
buffers.Operation Modifiers:000: Base Operation001: Server Takeover Warning: informs the targeted
VASI server that another VASI server had previously
hosted the operation stream and that it may need to
perform additional steps to process this request.010 : 111 ReservedVASI CRQ Message Definition (Normative)For the VASI interface, all CRQ messages are defined to use the following base format:
General Form of VASI Reliable CRQ ElementByte OffsetField NameDescription0HeaderContains Element Valid Bit plus Event Type Encodings (
).1Format/Transport Event CodeFor Valid Command Response Entries, see
.
For Transport Event Codes see
.2-15PayloadFormat dependent.
R1--1.For the VASI option: The format byte of VASI
CRQ messages must be as defined in
.
Reliable CRQ Entry Format Byte Definitions for VASI (Header=0x80)Format Byte ValueDefinition0x0Unused0x1VASI Capabilities Request0x2VASI Open Request0x3VASI Close Request0x4VASI Add Buffer Request0x5VASI Free Buffer Request0x6VASI Download Request0x07VASI Operation Request0x8VASI Signal Request0x9VASI State Request0x0A-0x0FReserved0x10VASI Progress Request0x11-0x80Reserved0x81VASI Capabilities Response0x82VASI Open Response0x83VASI Close Response0x84VASI Add Buffer Response0x85VASI Free Buffer Response0x86VASI Download Response0x87VASI Operation Response0x88VASI Signal Response0x89-0x8FReserved0x90VASI Progress Response0x91-0xFFReserved
R1--2.For the VASI option: The status byte
of VASI CRQ response messages must be as defined in
able 252‚ “VASI Reliable CRQ Response Status Values‚” on page 721.
.
VASI Reliable CRQ Response Status ValuesFormat Byte ValueDefinition0x0Success0x1Hardware Error0x2Invalid Stream ID0x3Stream ID Abort0x4Invalid Buffer Descriptor: Either the IOBA is too large for
the LIOBN or its logical TCE does not contain a valid
logical address mapping.0x5Invalid buffer length: Either the buffer is less than the
minimum useful buffer size or it does not match one of the first
“ibm,#buffer-pools”
sizes that were added.0x6Empty: The request could not be satisfied because the
buffer pool was empty0x7Invalid VASI Download Request Specifier0x8Invalid VASI Download data: The download data format is invalid.0x9Invalid Buffer Correlator: Does not correspond to an
outstanding data buffer.0x0AInvalid VASI Operation Request Specifier0x0BInvalid Service Specifier0x0CToo many opens0x0DInvalid BOTTOM0x0EInvalid TOP0x0FEnd of File0x10Invalid Format0x11Unknown Reserved Value0x12Invalid State Transition0x13Race Lost0x14Invalid Signal Code0x15-0xFFReserved
VASI Request/Response PairsR1--1.For the VASI option:
The platform must validate the format byte in all VASI messages that it receives.R1--2.For the VASI option:
The platform must initiate the processing of VASI messages in the order received
on a given CRQ.R1--3.For the VASI option:
If the format byte value of a received VASI message, as specified in
,
is “Unused”, “Reserved”, “VASI Operation Request”, or a response other
than “VASI Operation Response”, the platform must declare the format byte invalid.R1--4.For the VASI option:
If the format byte value is invalid, then the platform must generate a response
message on the corresponding CRQ by copying the received
message with the high order format byte bit set
to a one and the status byte with the “Invalid Format”
status code, and discard the received CRQ message.R1--5.For the VASI option:
The platform must fill in all reserved fields in VASI messages that it generates with zeros.R1--6.For the VASI option:
The platform must check that all reserved fields in a VASI message, except the
for the Capability String of the VASI Exchange Capabilities message, that it receives are filled with zeros,
else return the corresponding VASI reply message with a status of “Unknown Reserved Value”.R1--7.For the VASI option:
The VASI Exchange Capabilities message must be as defined in
.R1--8.For the VASI option:
The VASI Open message must be as defined in
.R1--9.For the VASI option:
The platform must process the VASI Open Request message per the semantics described in
.R1--10.For the VASI option:
The VASI Close message must be as defined in
.R1--11.For the VASI option:
The platform must process the VASI Close Request message per the semantics described in
.R1--12.For the VASI option:
The VASI Add Buffer message must be as defined in
.R1--13.For the VASI option:
The platform must process the VASI Add Buffer Request message per the semantics described in
.R1--14.For the VASI option:
The VASI Free Buffer message must be as defined in
.R1--15.For the VASI option:
The platform must process the VASI Free Buffer Request message per the semantics described in
.R1--16.For the VASI option:
The platform must process the VASI Download Request message per the semantics described in
.R1--17.For the VASI option:
The VASI Download message must be as defined in
.R1--18.For the VASI option:
The platform must process the VASI Operation Response message per the semantics described in
.R1--19.For the VASI option:
The VASI Operation message must be as defined in
.R1--20.For the VASI option:
The platform must process the VASI State Request message per the semantics described in
.R1--21.For the VASI option:
The VASI State message must be as defined in
.R1--22.For the VASI option:
The platform must process the VASI Progress Request message per the semantics described in
.R1--23.For the VASI option:
The VASI Progress message must be as defined in
.R1--24.For the VASI option:
The platform must process the VASI Signal Request message per the semantics described in
.R1--25.For the VASI option:
The VASI Signal message must be as defined in
.R1--26.For the VASI option:
To avoid a return code of “Invalid TOP” or “Invalid BOTTOM”; the VASI
messages: VASI Progress, VASI Add Buffer, VASI Free Buffer, VASI Download, VASI Operation, VASI Signal
and VASI State requests must only be sent after successful VASI Opens and prior to a VASI Close.VASI Exchange CapabilitiesThe VASI Exchange Capabilities command response pair is used to negotiate run time characteristics of the VASI virtual
IOA. The using partition issues one VASI Exchange Capabilities request message for each service that it plans to
support, filling in the Capability String field of the exchange capabilities request (see
)
with the values that it plans to use for that service and enqueues the request. The VASI virtual
IOA copies to the response Capability String, the values from the request capability string that it can support. The Capability
string boolean fields are defined such that zero indicates that the characteristic is not supported. Capability
string fields that represent numeric values may be reduced by the VASI virtual IOA from the requested value to the
supported value with the numeric value zero being possible.Status Values defined for the VASI Exchange Capabilities response message:SuccessHardware
Capability String FieldsField NameLocation (Byte:Bit - Byte:Bit)DescriptionService3:0 - 3:7Supported Services code see Reserved 14:1 - 13:5Reserved for future expansionSupported Download Forms13:6 Immediate
13:7 IndirectThe forms of VASI Download that are supported. This is a bit
field so any combination is possible to represent. Immediate
and indirect refer to the buffer placement, either directly
following in the operation specifier or at a location specified by
an address.Supported Operations14:0 Read Squential Immediate
14:1 Read Sequential Indirect
14:2 Read Indexed Immediate
14:3 Read Indexed Indirect
14:4 Write Sequential Immediate
14:5 Write Sequential Indirect
14:6 Write Indexed Immediate
14:7 Write Indexed Indirect The forms of VASI Operations that are supported. This is a bit
field so any combination is possible to represent. Sequential
and indexed refer to the starting point of the operation
(following the last operation or at a specific block offset).
Immediate and indirect refer to the buffer placement, either
directly following in the operation specifier or at a location
specified by an address.#Opens15:0 - 15-7Number of opens (unique TOP/BOTTOM pairs) per VASI
stream that are supported on this VASI Virtual IOA. Valid
values (1-255)
VASI OpenThe VASI Open Command message, see
,
indicates to the hypervisor that the originator VASI device driver is prepared to provide
the indicated processing service (role) for the indicated VASI stream.The VASI Open Response message indicates to the originating VASI device driver that the hypervisor is prepared to
proceed with the indicated VASI stream.Status Values defined for the VASI Open response message:SuccessHardwareInvalid Stream ID: The Stream ID parameter is not currently valid for this VASI virtual device.Stream ID AbortedToo many opensInvalid Service Specifier: Either reserved value or service not defined for this VASI stream.Semantics for VASI Open Request Message:Construct VASI Open Response message prototype (Including service parameter from request).Copy low order 8 bytes from Request message to response prototype.Verify that the Stream ID parameter of the VASI Open Request message is valid for the caller, else respond with the
status of “Invalid Stream ID”.Verify that the Service parameter of the VASI Open Request message is valid for the caller plus Stream ID pair, else
respond with the status of “Invalid Service Specifier”. Note that the valid “Service” values vary with the specific
high level function being performed (see
)
and the role assigned to the calling partition by mechanisms outside of the scope of PAPR.If the state of the VASI stream specified by the Stream ID of a VASI Open Request message is “Aborted”, respond
with the status value of “Stream ID Aborted”.If the maximum number of opens has not been reached, then allocate control structures to maintain the state of this
open instance and associate them with a unique BOTTOM parameter -- copy BOTTOM parameter to response message;
else respond with “Too many opens”.Record the associated TOP parameter value for use in subsequent VASI response and operation request messages.Respond with Success.VASI CloseThe VASI Close Command message, see
,
requests the receiver to close the indicated BOTTOM instance of the VASI stream.
Note, other BOTTOM instances remain open.The VASI Close Response message indicates that the VASI Close command receiver has processed the associated
VASI Close Command message and all previously enqueued messages to the BOTTOM instance. No further CRQ
messages will be enqueued by the closed BOTTOM service, and all enqueued buffers are forgotten.Status Values defined for the VASI Close response message:SuccessHardwareInvalid BOTTOMSemantics for VASI Close Request Message:Construct VASI Close Response message prototype (copy low order 14 bytes from request message).Validate the BOTTOM parameter is valid for caller, else respond “invalid BOTTOM”Transition the service for the specified VASI stream instance to the “Closed” state -- This process ends after all previously
initiated VASI request messages for the BOTTOM instance have completed.Insert the TOP recorded at open time for the specified BOTTOM into the response prototype.Free queued buffers and deallocate the control structures associated with the BOTTOM parameter, then respond
Success.VASI Add BufferThe VASI Add Buffer Command message, see
,
indicates to the hypervisor that the originator VSP device driver is providing the hypervisor with an empty
buffer for the specific BOTTOM instance.The hypervisor organizes its input buffers into N buffer pools per service, by size as indicated by the buffer descriptor.
The VASI
“ibm,#buffer-pools”
device tree property relates how many buffer size pools the firmware implements.
The first N different sizes supplied by the device driver specifies the sizes of the N buffer size pools -- buffers of
other sizes are rejected.The VASI Add Buffer Response message indicates to the originating VASI device driver that the hypervisor has processed
the associated VASI Add Buffer Command message. All VASI Add Buffer CRQ messages generate a VASI
Add Buffer Response message to provide feedback to the VASI device driver for flow control of the firmware's VASI
CRQ.The successful Add Buffer Response CRQ message indicates the buffer size of the pool upon which the buffer was enqueued,
and the number of free buffers on the indicated buffer size pool after the add (to indicate buffer utilization).Status Values defined for the VASI Add Buffer response message:SuccessHardwareInvalid BOTTOMInvalid Buffer DescriptorInvalid Buffer LengthSemantics for VASI Add Buffer Request Message:Construct VASI Add Buffer Response message prototype (copy low order 14 bytes from the request message to the
response prototype).Validate the BOTTOM field, else respond “Invalid BOTTOM”.Insert the TOP recorded for the open BOTTOM into the response prototype.Validate high order Buffer Descriptor bit is 0b1, else respond with “Invalid Buffer Descriptor”Validate that the Buffer Descriptor address translates through the LIOBN of the first pane of the
“ibm,my-dma-window”
property, else respond with “Invalid Buffer Descriptor”.Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype.If the Buffer Descriptor length field does not match the buffer length of one of the buffer pools then:
If buffer lengths are assigned to all buffer pools, then respond with “Invalid Buffer Length”Else select an unassigned buffer pool, and assign its length to match the length field of the Buffer Descriptor.Enqueue the buffer descriptor onto the per service session (“BOTTOM”) pool whose buffer length matches the
length field of the Buffer Descriptor, increment the Free Buffers in Pool count for the pool; inserting the result into
the response prototype along with the buffer size, clear the reserved fields and respond with “Success”VASI Free BufferThe VASI Free Buffer Command message, see
requests the hypervisor to return an empty data buffer of the specified size to the originator VSP device
driver. This call is used to recover buffers. It may be used to recover buffers at the completion of a VASI operation
stream. All buffers added to a VASI virtual IOA service session (“BOTTOM”) are forgotten by the virtual IOA when
the service session is closed or the IOA’s CRQ is deregistered.The VASI Free Buffer Response message indicates to the originating VASI device driver that the hypervisor has processed
the associated VASI Free Buffer Command message. All VASI Free Buffer CRQ messages generate a VASI
Free Buffer Response message. If the Status field of the VASI Free Buffer Response CRQ message is “Success” then
the last 8 bytes contain the Buffer Correlator (first 8 bytes of the data buffer) of the selected empty data buffer. The last
8 bytes of the VASI Free Buffer Response CRQ message are undefined for any non “Success” Status value.Status Values defined for the VASI Free Buffer response message:SuccessHardwareInvalid BOTTOMInvalid Buffer LengthEmptySemantics for VASI Free Buffer Request Message:Construct VASI Free Buffer Response message prototype with the Buffer Correlator field zero.Validate the BOTTOM field, else respond “Invalid BOTTOM”.Insert the TOP recorded for the open BOTTOM into the response prototype.If the request message Buffer Length field does not match one of the pool lengths, then respond “Invalid Buffer
Length”.If the buffer pool associated with the Buffer Length field is empty, then respond “Empty”.Dequeue a Buffer Descriptor from the buffer pool associated with the Buffer Length field.Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype
and respond “Success”.VASI DownloadThe VASI Download Command message, see
requests the hypervisor to process the VASI Download data buffer specified by the
originator VSP device driver.The VASI Download Response message indicates to the originating VSP that the hypervisor has processed the associated
VASI Download Command message. Unless the Status field of the VASI Download Response CRQ message is
“Invalid Buffer Descriptor”, the last 8 bytes contain the Buffer Correlator (first 8 bytes of the data buffer) of the data
buffer specified by the Buffer Descriptor field of the VASI Download Command CRQ message. The last 8 bytes of the
VASI Download Response CRQ message are undefined for the “Invalid Buffer Descriptor” Status value.Status Values defined for the VASI Download response message:SuccessHardwareInvalid BOTTOMInvalid Buffer DescriptorInvalid VASI Download Request SpecifierInvalid VASI Download dataSemantics for VASI Download Request Message:Construct VASI Download Response message prototype (copy low order 14 bytes from Request message to response
prototype).Validate the BOTTOM field, else respond “Invalid BOTTOM”.Insert the TOP recorded for the open BOTTOM into the response prototype.Validate high order Buffer Descriptor bit is 0b1, else respond with “Invalid Buffer Descriptor”Validate that the Buffer Descriptor address translates through the LIOBN of the first pane of the
“ibm,my-dma-window”
property, else respond with “Invalid Buffer Descriptor”.Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype.Verify that the BOTTOM parameter of the buffer’s VASI Download Request Specifier is valid for the caller and the
Download service for the associated Stream ID is Open by the caller, else respond with “Invalid VASI Download
Request Specifier”.The Download service processes the buffer data; if an error is detected in the buffer data respond with “Invalid VASI
Download data”, else respond with “Success”.VASI OperationThe VASI Operation Request message, see Figure 47‚ “Format of the VASI Operation CRQ elements‚” on page 731,
requests the receiving VSP to process the VASI Operation specified in the data buffer indicated by the Buffer Correlator
field. The Buffer Correlator field is copied from the first 8 bytes of the data buffer as supplied by the VSP using the
VASI add buffer request. VASI Operation Requests are used to upload data on migration/hibernation (Write Sequential)
and for VPM paging requests (using indexed Read/Write).The VASI Operation Response message indicates to the hypervisor that the VSP has processed the associated VASI
Operation Command message. Unless the Status field of the VASI Operation Response CRQ message is “Invalid
Buffer Correlator”, the last 8 bytes contain the Operation Correlator (fourth 8 bytes of the data buffer) of the data buffer
that the hypervisor selected for this operation. The last 8 bytes of the VASI Operation Response CRQ message are undefined
for the “Invalid Buffer Correlator” Status value. The VSP validates that the TOP parameter corresponds to an
open instance against a VASI stream ID, else it responds “Invalid TOP”. Similarly the VSP validates the format of the
remainder of the buffer, else responds “Invalid VASI Operation Request Specifier”.Status Values defined for the VASI Operation response message:SuccessHardwareInvalid Buffer CorrelatorInvalid TOPInvalid VASI Operation Request SpecifierStream ID AbortedEnd of FileSemantics for VASI Operation Response Message:Verify that the Operation Correlator references a valid outstanding VASI Operation, else discard message.
NOTE: while an invalid operation correlator is a very serious error there is no obvious instance to which to deliver the
error.Mark the operation control block as referenced by the Operation Correlator with the Status and Residual values, refer
to , from the Response message and mark the
response message as being processed.Further processing of the operation control block is covered in the specification for the specific VASI Operation
Stream. See .VASI SignalThe VASI Signal Command message (See
)
informs the VASI Virtual IOA of the VASI Stream, associated with the BOTTOM parameter, of the condition specified
by the “Signal Code” parameter; optionally, a non-zero reason code may be associated with the event so that firmware
may record the event using facilities and methods that are outside the scope of this architecture.The valid signal codes, and reason codes are unique to the specific VASI operation stream. See
and
respectively for more details.Status Values defined for the VASI State response message:SuccessHardwareInvalid BOTTOMInvalid Signal CodeSemantics for processing the VASI Signal Request Message:Construct VASI Signal Response message prototype (copy the low order 14 bytes from the Request message to the
response prototype).Validate the BOTTOM parameter for the caller; else respond “Invalid BOTTOM”Insert the TOP recorded for the open BOTTOM into the response prototype.Determine the VASI stream associated with the BOTTOM parameter.If the “Signal” parameter represents an invalid signal
code for the VASI operation stream represented by the BOTTOM parameter (refer to
),
then respond “Invalid Signal Code”.Initiate the VASI stream event processing for the VASI operation
stream represented by the BOTTOM parameter as defined under
for the current state and the condition represented by the “Signal”
parameter, record the value of the “Reason” parameter, and respond “Success”.VASI StateThe VASI virtual IOA generates a VASI State Request message, see
,
to each VASI open session instance (TOP), that is associated (through a VASI Open) with the
VASI Stream ID, each time the VASI stream changes state. Such state change request messages may include an optional
non-zero reason code.No VASI State Response message is defined. The VASI State Request message is informational, and the receiver does
not generate a response.The valid states, state transitions, and reason codes are unique to the specific VASI operation stream, see
.Semantics for VASI State Request Message sent only after all other VASI stream state transition processing completes:For each TOP opened for the VASI stream that changed state.
Construct VASI State Request message prototype.Fill in the TOP from the values recorded at VASI open time.Fill in the “Reason” and “To” fields per the completed transition.Enqueue the request message to the CRQ used to open the associated TOP.Mark the VASI stream state transition complete.VASI ProgressThe VASI Progress Command message, see
,
is applicable to Migration and Hibernation high level operations. It requests the hypervisor to report the number of bytes
of partition state that need to be processed for the VASI migration/hibernation stream associated with the “BOTTOM”
parameter. If this request is made prior to any state transfer requests, it represents the total size of the partition state
data.If the Status field of the VASI Progress Response CRQ message is “Invalid BOTTOM”, the last 8 bytes of the VASI
Progress Response CRQ message are copied from the corresponding VASI Progress Request message in all cases.Status Values defined for the VASI State response message:SuccessHardwareInvalid BOTTOMInvalid Service SpecifierSemantics for VASI Progress Request Message:Construct VASI Progress Response message prototype (copy the low order 14 bytes from Request message to response
prototype).Validate the BOTTOM parameter for the caller, else respond “invalid BOTTOM”Insert the TOP recorded for the open BOTTOM into the response prototype.Validate that the operation stream associated with the BOTTOM parameter is either a migration or a hibernation;
else respond “Invalid Service Specifier”.Estimate the number of bytes left to transfer (this is best effort since the number may constantly change) placing the
value into the “Number of Bytes Left” field and respond Success.
For the source side of an operation the estimate of the number of bytes left is the number of bytes of dirty status.For the destination side of an operation the estimate of the number of bytes left is the number of expected status
bytes that the destination knows are not valid (either they were never sent or there has been an indication that they
were subsequently made invalid).VASI Virtual IOA Device Tree
Properties of the VASI Node in a PartitionProperty NameRequired?Definition“name”YIBM,VASI“device_type”YIBM,VASI-1“model”NAProperty not present.“compatible”YIBM,VASI-1“used-by-rtas”NProperty not present.“ibm,loc-code”YProperty name specifying the unique and
persistent location code associated with this virtual IOA
presented as an encoded array as with
encode-string.
The value shall be of the form specified in
information on
Virtual Card Connector Location Codes.“reg”YStandard property name per IEEE 1275 specifying the
register addresses, used as the unit address (unit
ID), associated with this virtual IOA presented as
an encoded array as with encode-phys of length
“#address-cells”
value shall be 0xwhatever (virtual
“reg”
property used for unit address no actual locations used, therefore, the size field
has zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window
associated with this virtual IOA presented as an encoded
array of tripples; each triple consisting of three values (LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int respectively.“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual
IOA presented as an encoded array of two cells encoded as with
encode-int with the first cell
containing the interrupt source number, and the
second cell containing the sense code 0 indicating positive
edge triggered. The interrupt source number being the value
returned by the H_XIRR or H_IPOLL hcall().“ibm,my-drc-index”For DRPresent if the platform implements DR for this node.“ibm,#dma-size-cells”NProperty name, to define the package’s dma address
size format. The property value specifies the number
of cells that are used to encode the size field of
dma-window properties. If the
“ibm,#dma-size-cells”
property is missing, the default value is the
“#size-cells”
property for the parent package.“ibm,#dma-address-cells”NProperty name, to define the package’s dma address
format. The property value specifies the number
of cells that are used to encode the physical address field of
child's dma-window properties. If the
“ibm,#dma-address-cells”
property is missing, the default value is the
“#address-cells”
property for the parent package.“ibm,#buffer-pools”YProperty name to define number, encoded as with
encode-int
of different data buffer size pools
supported by the VASI virtual IOA service sessions.“ibm,crq-size”YProperty name to define the size, in bytes, of the VASI virtual IOA CRQ; encoded as with
encode-int.“ibm,#vasi-streams”YProperty name to define the number of simultaneous
unique VASI stream IDs that may be supported by
the VASI virtual IOA CRQ; encoded as with
encode-int.
VASI Support hcall()sThe hcall()s of this section support the VASI option. H_DONOR_OPERATION supplies the hypervisor with processor
cycles to perform administrative services. H_VASI_SIGNAL allows partitions to signal anomalous conditions such as
the need to abort the administrative service stream without having to have an open VASI virtual IOA. While the
H_VASI_STATE allows partitions that do not have an open VASI virtual IOA for a given VASI stream ID to poll the
state of their administrative service streams.H_DONOR_OPERATIONThis hcall() supplies donor partition cycles to perform hypervisor operations for a given VASI Stream. The TOP/BOTTOM
parameter indicates the VASI stream, and thus a specific operating context relative to the caller and callee. The
cycles donated by any and all TOP/BOTTOMs associated with the VASI Stream are combined by the platform to perform
the needed processing for the stream. A platform may use the cycles from different TOP/BOTTOM pairs to create
parallel processes to improve the stream performance.Syntax:Parameters:TOP/BOTTOM_ID (The opaque handles of a specific VASI operation stream relative to the caller and callee.)Semantics:If the TOP/BOTTOM_ID parameter is invalid relative to the calling partition, return H_Parameter.If the VASI stream is in the aborted state, return H_Aborted.Perform the next operation associated with the specified VASI stream. Note the amount of processing performed on
any one call is limited by the interrupt hold off constraints of standard hypervisor calls. (The format of the platform
operation state structure is outside of the scope of this architecture.)If the specific VASI stream operation is fully complete, return H_Success.If the specific VASI stream requires more processing to fully complete the platform operation and is not blocked
waiting for asynchronous agent(s), return H_IN_PROGRESS.If the VASI stream is blocked waiting for asynchronous agent(s), return H_LongBusyOrder* (where * is the appropriate
expected waiting time).R1--1.For the VASI option:
The platform must implement the H_DONOR_OPERATION hcall() following
the syntax and semantics of
.H_VASI_SIGNALThis hcall() signals the condition specified by the “signal code”
parameter to the VASI Virtual IOA of the VASI Stream
associated with the “handle” parameter; optionally, a non-zero
“reason code” may be associated with the signal code so
that firmware may record the signal using facilities and methods
that are outside the scope of this architecture.Syntax:Parameters:handle -- the VASI Stream ID (The opaque handle of a specific VASI operation stream.)signal_code -- one of the values listed in
right justified with high order bytes zero.reason_code -- Code user gives as reason for signal right
justified with high order bytes zero -- The value is simply
transported not checked by the platform.Semantics:If the “handle” parameter is invalid relative to the calling partition, then return H_Parameter.If the “signal_code” is invalid based upon the values listed in
,
then return H_P2.If the “signal_code” is valid for the current VASI stream state,
initiate the processing defined for the “signal_code”;
else return H_NOOP.
VASI Signal CodesNameValueDescriptionVASI Operation StreamValid for InterfaceVASI Signal MessageH_VASI_SIGNALNull0x0Not used (invalid)AllNNCancel0x1Gracefully cancel processing if possiblePartition MigrationPartition HibernationYYAbort0x2Immediately halt functionPartition MigrationPartition HibernationYNSuspend0x3Suspend target partitionPartition MigrationPartition HibernationYNComplete0x4Complete paging operationPagingYNEnable0x5Enabling paging operationPagingYNReserved0x6-0xFFFFReservedAllNN
R1--1.For the VASI option:
The platform must implement the H_VASI_SIGNAL hcall() following the syntax and semantics of
.H_VASI_STATEThis hcall() returns the state of the specific VASI operation stream.Syntax:Parameters:handle -- the VASI Stream ID (The opaque handle of a specific VASI operation stream relative to the caller and callee.)Semantics:If the “handle” parameter is invalid relative to the calling partition, return H_Parameter.Else enter the value of the VASI state variable (see
)
for the indicated stream into R4 and return H_SuccessR1--1.For the VASI option:
The platform must implement the H_VASI_STATE hcall() following the syntax and semantics of
.VASI Operation Stream SpecificationsThis section defines the usage of VASI to accomplish specific administrative services. Each section specifies the valid
VASI state codes, state transitions, and reason codes, the action of the VASI virtual IOA in each state and the expected
behavior of the VASI device driver in order to achieve the operational goal.
VASI Stream Services for Partition MigrationNameValueDescriptionUnused0Source Mover for Partition Migration1VASI device will be used to extract partition state from the source platform to the target
platform using VASI Operations (Write sequential) to extract partition state, and VASI
Download commands to give source platform paging requests. See
.Target Mover for Partition Migration2VASI device will be used to insert migrating partition’s state to the target platform. VASI
Download requests will be used to give platform firmware partition state, and VASI
Operations (Write sequential) will be used by platform firmware to give paging requests to
the Mover partition to deliver to the source platform.See
.Pager for the CMO option3VASI device will be used to handle CMO paging requests See
.
Partition Migration
defines the VASI Services for Partition Migration for use in the VASI Open CRQ request, as defined in
.Requirements:R1--1.For the Partition Migration Option:
If any partition code uses the value of the processor PVR to modify its operation, to ensure
valid operation after the resume from suspension, prior to executing any such
modified operation code, partition code must reread the PVR value and be prepared to remodify its operation.R1--2.For the Partition Migration Option:
In order that LAN communication partners may learn of the
new MAC address that may be associated with a migrated partition, the migrated partition must generate
“gratuitous ARP” messages. It is suggested that these “gratuitous ARP” messages be sent at the rate of once
per second between the time that the migrating partition resumes and the H_VASI_STATE hcall() responds
with “Completed”.R1--3.For the Partition Migration Option:
To maintain the platform capability to perform live firmware
updates, the OS must call the
ibm,activate-firmware RTAS service after waking from a migration suspension.R1--4.For the Partition Migration Option:
The platform must implement the ILLAN option (see
).R1--5.For the Partition Migration Option:
Platform firmware must support both immediate and indirect
data in its VASI Download data buffers.R1--6.For the Partition Migration Option:
If multiple partition migrations are being performed using a
single VASI device, to ensure none of the migrations are starved, partition software must call
H_DONOR_OPERATION with TOP/BOTTOMs associated with each migration (VASI Stream ID).R1--7.For the Partition Migration Option:
If the platform detects any unrecoverable error in processing
a VASI Download command, it must transition the associated VASI stream to the “Aborted” state.R1--8.For the Partition Migration Option:
The VASI stream ID for the specific high level migration
function must be the same value in both the source and target platforms.Programming Note:
If partition software wishes to get an accurate count of the number of bytes to be transferred using
the VASI Progress CRQ message, it should be issued immediately following a VASI open and before any cycles
are donated for that migration via H_DONOR_OPERATION.Partition Migration Abort Reason Codes
defines the Abort reason code layout for Partition Migration for use with the
H_VASI_SIGNAL hypervisor call and the VASI Signal and State CRQ requests, as defined in
.
Partition Migration Abort Reason CodesNameByteDescriptionAborting Entity01=Orchestrator
2=VSP providing VASI partition source migration service
3=Partition Firmware
4=Platform Firmware
5=VSP providing VASI partition target migration service
6=Migrating partitionDetailed Error1-3Bytes one through three of the reason code are opaque values, architected by
individual aborting entities.
Partition Migration VASI StatesThis section defines the partition migration VASI states as used in the VASI State request CRQ message and as returned
from the H_VASI_STATE hcall.
VASI Migration Session StatesNameValueDescriptionInvalid0x0This state is defined on both the source and destination platform
and indicates either that the specified Stream ID is not valid (or
is no longer valid) or the invoking partition is not authorized to
utilize the Stream ID.Enabled1This state is defined on both the source and destination platform
and indicates that the partition has been enabled for migration
but has not progressed beyond this initial state.The transition to this state is triggered by events outside of the
scope of PAPR.The partition on the source server transitions to this state first
and then the partition on the destination server.Aborted2This state is defined on both the source and the destination
platform and indicates that the abort processing has completed.If the migration has been aborted, this is the final state of the
migration and platform firmware ensures that all interested
partitions see this state at least once. Platform firmware
continues to return this state until events outside of the scope of
PAPR terminate the operation and all interested partitions have
seen this final state.In this state VASI download request information is flushed,
returning success status. VASI signal requests other than
“abort” are treated as an invalid state transition.
The transition to this state occurs on the two servers
independently and thus it is a race condition which server
transitions to this state first.Suspending3This state is defined only on the source platform and indicates
that the partition is in the process of suspending itself. When the
migrating partition sees this state, it enters its suspension
sequence that concludes with the call to ibm,suspend-me.The transition to this state occurs when the source VSP directs
platform firmware to suspend the partition via a VASI Signal
request (Signal Code = Suspend) on the VASI device.Suspended4This state is defined only on the source platform and indicates
that the partition has suspended itself via the ibm,suspend-me
RTAS call. This is the point in the sequence where platform
firmware will reject attempts by the user to abort the migration.Resumed5This state is defined on both the source and destination platform
and indicates that the partition has resumed execution on the
destination platform.The transition to this state occurs on the destination platform
first when it receives the dirty page bitmap from the source
platform firmware. It is at this point the virtual processors of the
migrating partition are dispatched on the destination platform.Completed6This state is defined on both the source and destination platform
and indicates that the migration has completed and all partition
state has been transferred to the destination platform. This is the
final state of the migration and platform firmware ensures that
all interested partitions see this state at least once. Platform
firmware continues to return this state until events outside of the
scope of PAPR terminate the operation and all interested
partitions have seen this final state.The transition to this state occurs on the source platform first as
soon as all of the residual state of the migrating partition has
been successfully transferred to the destination platform. The
transition to this state on the destination platform occurs when
all of the partition state has been received from the source
platform.For an inactive migration, the partition is transferred as a single
unit so the partition in the destination platform just moves from
Enabled to Completed on a successful inactive migration.
Partition HibernationR1--1.For the Partition Hibernation Option:
The platform must ensure that all hibernating partition dynamic
reconfiguration operations are complete prior to signaling suspension of the partition.R1--2.For the Partition Hibernation Option:
If any partition code uses the value of the processor PVR
to modify its operation, after the resume from suspension, but prior to executing any such modified operation
code, it must reread the PVR value and be prepared to remodify its operation.R1--3.For the Partition Hibernation Option:
In order that LAN communication partners may learn of
the new MAC address that may be associated with a hibernated partition the hibernated partition must generate
“gratuitous ARP” messages. It is suggested that these “gratuitous ARP” messages be sent at the rate of
once per second between the time that the hibernated partition resumes and the H_VASI_STATE hcall() responds
with “Completed”.R1--4.For the Partition Hibernation Option:
To maintain the platform capability to perform live firmware
updates, the OS must call the ibm,activate-firmware
RTAS service after waking from a hibernation suspension.R1--5.For the Partition Hibernation Option:
The platform must implement the ILLAN option (see
).R1--6.For the Partition Hibernation Option:
The VASI stream ID for the specific high level migration
function must be the same value for both the suspend and wake phases.Cooperative Memory OvercommitmentThe CMO option defines the stream service value 3 for “Pager”. The Pager VASI device is used to page out paging partition
state to the VASI Server Partition (VSP) using VASI Operation requests (Write indexed) and also to page in partition
state from the VSP using VASI Operation requests (Read indexed). The Pager VASI service utilizes a subset of
the VASI Operation request architecture; specifically in the VASI Operation Request Specifier structure, the File offset
of the start for indexed operations field (Bytes 9:15) is not used (value = 0x00); the scatter/gather list is a series of 1 –
N sub operation specifications each starting with the positioning of the file cursor using a type 0xC0 control element to
establish the file block location, optionally followed by a type 0x40 control element to position the file cursor to a byte
within the established file block, this is followed by one and only one type 0x80 control element per sub operation to
transfer the sub operation data. The VASI Operation Request Specifier structure terminates as always with a type 0x00
control element with a segment length field of 0x000000.When a Pager VASI service aborts, the reason code returned is per
.
The Pager Service VASI States as in the state request CRQ message and as returned from the
H_VASI_STATE hcall are as defined in
.
CMO VASI Abort Reason CodesNameByteDescriptionEntity (who is issuing state change or signal)01 = VASI
2 = I/O Provider
3 = Platform FirmwareDetailed Reason1-3Bytes one through three of the reason code are opaque values, architected by individual entities.
CMO VASI StatesNameValueDescriptionInvalid0x0This state indicates that the specified Stream ID is not valid (or is no longer valid) or the invoking
partition is not authorized to utilize the Stream ID.Disabled1This state indicates that the specified Stream ID is valid, but the stream has not been yet opened
by the VSP providing VASI paging service. The transition to this state is triggered by events
outside of the scope of PAPR.Suspended2This state indicates that the specified Stream ID is valid, but the client partition has not yet been
powered onEnabled3This state indicates that the stream has been opened by the VSP providing VASI paging service
and the client partition is powered onStopped4This state indicates that the specified Stream ID is valid, however platform firmware is no longer
using the stream to perform paging. The transition to this state is triggered by events outside of the
scope of PAPR.Completed5This state indicates that paging has been terminated for this stream by a request to halt paging for
this Stream ID.
Virtual Fibre Channel (VFC) using NPIVN_Port ID Virtualization (NPIV) is part of the Fibre Channel (FC)
standards. NPIV allows multiple World Wide Port Names (WWPNs) to be mapped
to a single physical port of a FC adapter. This section defines a Virtual
Fibre Channel (VFC) interface to a server partition interfacing to a
physical NPIV adapter that allows multiple partitions to share a physical
port using different WWPNs. The implementation support is provided by code
running in a server partition that uses the mechanisms of the Reliable
Command/Response Transport and Logical Remote DMA of the Synchronous VIO
Infrastructure to service I/O requests for code running in a client
partition. The client partition appears to enjoy the services of its own FC
adapter (see
) with a WWPN visible to the FC
fabric. The terms server and client partitions refer to platform partitions
that are respectively servers and clients of requests, usually I/O
operations, using the physical I/O adapters (IOAs) that are assigned to the
server partition. This allows a platform to have more client partitions
than it may have physical I/O adapters because the client partitions share
I/O adapters via the server partition.The VFC model makes use of Remote DMA which is built upon the
architecture specified in the following sections:VFC and NPIV GeneralThis section contains an informative outline of the architectural
intent of the use of VFC and NIPV, and it assumes the user is familiar
with
concerning VSCSI architecture
and the with the FC standards. Other implementations of the server and
client partition code, consistent with this architecture, are possible
and may be preferable.The client partition provides the virtual equivalent of a single
port FC adapter via each VFC client IOA. The platform, through the
partition definition, provides means for defining the set of virtual
IOA’s owned by each partition and their respective location codes.
The platform also provides, through partition definition, instructions to
connect each client partition’s VFC client IOA to a specific server
partition’s VFC server IOA. The mechanism for specifying this
partition definition is beyond the scope of this architecture. The human
readable handle associated with the partition definition of virtual IOAs
and their associated interconnection and resource configuration is the
virtual location code. The OF unit address (Unit ID) remains the invariant handle upon which the
OS builds its “physical to logical” configuration. The
platform also provides a method to assign unique WWPNs for each VFC
client adapter. The port names are used by a SAN administrator to grant
access to storage to a client partition. The mechanism for allocating
port names is beyond the scope of this architecture.The client partition’s device tree contains one or more nodes
notifying the partition that it has been assigned one or more virtual
adapters. The node’s
“type” and
“compatible” properties notify the
partition that the virtual adapter is a VFC adapter. The
unit address of the node is used by the client
partition to map the virtual device(s) to the OS’s corresponding
logical representations. The
“ibm,my-dma-window” property communicates
the size of the RTCE table window panes that the hypervisor has
allocated. The node also specifies the interrupt source number that has
been assigned to the Reliable Command/Response Transport connection and
the RTCE range that the client partition device driver may use to map its
memory for access by the server partition via Logical Remote DMA. The
client partition also reads it's WWPNs from the device tree. Two WWPNs
are presented to the client in the properties
“ibm,port-wwn-1”, and
“ibm,port-wwn-2”, and the server tells
the client, through a CRQ protocol exchange, which one of the two to use.
The client partition, uses the four hcall()s associated with the Reliable
Command/Response Transport facility to register and deregister its CRQ,
manage notification of responses, and send command requests to the server
partition.The server partition’s device tree contains one or more
node(s) notifying the partition that it is requested to supply VFC
services for one or more client partitions. The unit address (
Unit ID) of the node is used by the server partition
to map to the local logical devices that are represented by this VFC
device. The node also specifies the interrupt source number that has been
assigned to the Reliable Command/Response Transport connection and the
RTCE range that the server partition device driver may use for its copy
Logical Remote DMA. The server partition uses the four hcall()s
associated with the Reliable Command/Response Transport facility to
register and deregister its Command request queue, manage notification of
new requests, and send responses back to the client partition. In
addition, the server partition uses the hcall()s of the Logical Remote
DMA facility to manage the movement of commands and data associated with
the client requests.The client partition, upon noting the device tree entry for the
virtual adapter, loads the device driver associated with the value of the
“compatible” property. The device driver,
when configured and opened, allocates memory for its CRQ (an array, large
enough for all possible responses, of 16 byte elements), pins the queue
and maps it into the I/O space of the RTCE window specified in the
“ibm,my-dma-window” property using the
standard kernel mapping services that subsequently use the H_PUT_TCE
hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next,
I/O request control blocks (within which the I/O requests commands are
built) are allocated, pinned, and mapped into I/O address space. Finally,
the device driver registers to receive control when the interrupt source
specified in the virtual IOA’s device tree node signals.Once the CRQ is setup, the device driver queues an Initialization
Command/Response with the second byte of “Initialize” in
order to attempt to tell the hosting side that everything is setup on the
hosted side. The response to this send may be that the send has been
dropped or has successfully been sent. If successful, the sender should
expect back an Initialization Command/Response with a second byte of
“Initialization Complete,” at which time the communication
path can be deemed to be open. If dropped, then the sender waits for the
receipt of an Initialization Command/Response with a second byte of
“Initialize,” at which time an “Initialization
Complete” message is sent, and if that message is sent
successfully, then the communication path can be deemed to be
open.When the VFC Adapter device driver receives an I/O request from one
of the FC device head drivers, it executes the following sequence. First
an I/O request control block is allocated. Then it builds the FC
Information Unit (FC IU) request within the control block, adds a
correlator field (to be returned in the subsequent response), I/O maps
any target memory buffers and places their DMA descriptors into the I/O
request control block. With the request constructed in the I/O request
control block, the driver constructs a DMA descriptor (Starting Offset,
and length) representing the FC IU within the I/O request control block.
It also constructs a DMA descriptor for the FC Response Unit. Lastly, the
driver passes the I/O request’s DMA descriptor to the server
partition using the H_SEND_CRQ hcall(). Provided that the H_SEND_CRQ
hcall() succeeds, the VFC Adapter device driver returns, waiting for the
response interrupt indicating that a response has been posted by the
server partition to the device driver’s response queue. The
response queue entry contains the summary status and request correlator.
From the request correlator, the device driver accesses the I/O request
control block, the summary status, and the FC Response Unit and
determines how to complete the processing of the I/O request.Notice that the client partition only uses the Reliable
Command/Response Transport primitives; it does not use the Logical Remote
DMA primitives. Since the server partition’s RTCE tables are not
authorized for access by the client partition, any attempt by the client
partition to modify server partition memory would be prevented by the
hypervisor. RTCE table access is granted on a connection by connection
basis (client/server virtual device pair). If a client partition happens
to be serving some other logical device, then the partition is entitled
to use Logical Remote DMA for the virtual devices that is serving.The server partition, upon noting the device tree entry for the
virtual adapter, loads the device driver associated with the value of the
“compatible” property. The device driver,
when configured and opened, allocates memory for its request queue (an
array, large enough for all possible outstanding requests, of 16 byte
elements). The driver then pins the queue and maps it into I/O space, via
the kernel’s I/O mapping services that invoke the H_PUT_TCE
hcall(), using the first window pane specified in the
“ibm,my-dma-window” property. The queue
is then registered using the H_REG_CRQ hcall(). Next, I/O request control
blocks (within which the I/O request commands are built) are allocated,
pinned, and I/O mapped. Finally the device driver registers to receive
control when the interrupt source specified in the virtual IOA’s
device tree node signals.Once the CRQ is setup, the device driver queues an Initialization
Command/Response with the second byte of “Initialize” in
order to attempt to tell the hosted side that everything is setup on the
hosting side. The response to this send may be that the send has been
dropped or has successfully been sent. If successful, the sender should
expect back an Initialization Command/Response with a second byte of
“Initialization Complete,” at which time the communication
path can be deemed to be open. If dropped, then the sender waits for the
receipt of an Initialization Command/Response with a second byte of
“Initialize,” at which time an “Initialization
Complete” message is sent, and if that message is sent
successfully, then the communication path can be deemed to be
open.When the server partition’s device driver receives an I/O
request from its corresponding client partition’s VFC adapter
drivers, it is notified via the interrupt registered for above. The
server partition’s device driver selects an I/O request control
block for the requested operation. It then uses the DMA descriptor from
the request queue element to transfer the FC IU request from the client
partition’s I/O request control block to its own (allocated above),
using the H_COPY_RDMA hcall() through the second window pane specified in
the
“ibm,my-dma-window” property. The server
partition’s device driver then uses kernel services, that are
extended, to register the I/O request’s DMA descriptors into
extended capacity cross memory descriptors (ones capable of recording the
DMA descriptors). These cross memory descriptors are later mapped by the
server partition’s physical device drivers into the physical I/O
DMA address space of the physical I/O adapters using the kernel services,
that have been similarly extended to call the H_PUT_RTCE hcall(), based
upon the value of the LIOBN field reference by the cross memory
descriptor. At this point, the server partition’s VFC device driver
delivers what appears to be a FC IU request to be routed through the
server partition’s adapter driver. When the request completes, the
server partition’s VFC device driver is called through a registered
entry point and it packages the summary status along with the request
correlator into a response message that it sends to the client partition
using the H_SEND_CRQ hcall(), then recycles the resources recorded in the
I/O request control block, and the block itself.The LIOBN value in the second window pane of the server
partition’s
“ibm,my-dma-window” property is intended
to be an indirect reference to the RTCE table of the client partition.
If, for some reason, the physical location of the client
partition’s RTCE table changes or it becomes invalid, this level of
indirection allows the hypervisor to determine the current target without
changing the LIOBN number as seen by the server partition. The H_PUT_TCE
and H_PUT_RTCE hcall()s do not map server partition memory into the
second window pane; the second window pane is only available for use by
server partition via the Logical RDMA services to reference memory mapped
into it by the client partition’s IOA.This architecture does not specify the payload format of the
requests or responses. However, the architectural intent is supplied in
the following tables for reference.
General Form of Reliable CRQ ElementByte OffsetField NameSubfield NameDescription0HeaderContains Element Valid Bit plus Event Type Encodings (see
).1PayloadFormat/Transport Event CodeFor Valid Command Response Entries, see
. For Transport Event
Codes see
.2-15Format Dependent.
Example Reliable CRQ Entry Format Byte Definitions
for VFCFormat Byte ValueDefinition0x0Unused0x01VFC Requests0x02 - 0x03Reserved0x04Management Datagram0x05 - 0xFEReserved0xFFReserved for Expansion
Example VFC Command Queue ElementByte OffsetField ValueDescription00x80Valid Header10x01VFC Requests10x04Management Datagram2-3NAReserved4-7Length of the request block to be transferred8-15I/O address of beginning of request
VFC and NPIV RequirementsThis normative section provides the general requirements for the
support of VFC.R1--1.For the VFC option: The platform must implement the
Reliable Command/Response Transport option as defined in
.R1--2.For the VFC option: The platform must implement the
Logical Remote DMA option as defined in
.R1--3.For the VFC option: The platform must allocate a WWPN
pair for each VFC client and must present the WWPNs to the VFC clients in
their OF device tree
.In addition to the firmware primitives, and the structures they
define, the partition’s OS needs to know specific information
regarding the configuration of the virtual IOA’s that it has been
assigned so that it can load and configure the correct device driver
code. This information is provided by the OF device tree node associated
with the virtual IOA (see
and
).Client Partition VFC Device Tree NodeClient partition VFC device tree nodes have associated packages
such as disk-label, deblocker, iso-13346-files and iso-9660-files as well
as children nodes such as block and byte as appropriate to the specific
virtual IOA configuration as would the node for a physical FC IOA.R1--1.For the VFC option: The platform’s OF device
tree for client partitions must include as a child of the
/vdevice node, a node of name
“vfc-client” as the parent of a sub-tree
representing the virtual IOAs assigned to the partition.R1--2.For the VFC option: The platform’s
vfc-client OF node must contain properties as defined
in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the VFC Node in the Client
PartitionProperty NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name, the
value shall be
“vfc-client”.“device_type”YStandard property name per
,
specifying the virtual device type, the
value shall be
“fcp”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA, the value shall include
“IBM,vfc-client”.“used-by-rtas”See Definition ColumnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form specified in
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of three values
(LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int.“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall().“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.“ibm,#dma-size-cells”See Definition ColumnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See Definition ColumnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.“ibm,port-wwn-1”See Definition ColumnProperty that represents one of two WWPNs assigned to
this VFC client node. This property is a
prop-encoded-array each encoded with
encode-int. The array consists of the high
order 32 bits and low order 32 bits of the WWPN such that (32
bits high | 32 bits low) is the 64 bit WWPN. The WWPN that the
client is to use (
“ibm,port-wwn-1” or
“ibm,port-wwn-2”) is
communicated to the client by the server as part of the
client-server communications protocol.“ibm,port-wwn-2”See Definition ColumnProperty that represents one of two WWPNs assigned to
this VFC client node This property is a
prop-encoded-array each encoded with
encode-int. The array consists of the high
order 32 bits and low order 32 bits of the WWPN such that (32
bits high | 32 bits low) is the 64 bit WWPN. The WWPN that the
client is to use (
“ibm,port-wwn-1” or
“ibm,port-wwn-2”) is
communicated to the client by the server as part of the
client-server communications protocol.
R1--3.For the VFC option: The platform’s
vfc-client node must have as children the appropriate
block (disk) and byte (tape) nodes.Server Partition VFC Device Tree NodeServer partition VFC IOA nodes have no children nodes.R1--1.For the VFC option: The platform’s OF device
tree for server partitions must include as a child of the
/vdevice node, a node of name
“vfc-server” as the parent of a sub-tree
representing the virtual IOAs assigned to the partition.R1--2.For the VFC option: The platform’s
vfc-server node must contain properties as defined in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the VFC Node in the Server
PartitionProperty NameRequired?Definition“name”YStandard property name per
,
specifying the virtual device name, the
value shall be
“vfc-server”.“device_type”YStandard property name per
,
specifying the virtual device type, the
value shall be
“fcp”.“model”NAProperty not present.“compatible”YStandard property name per
,
specifying the programming models that are
compatible with this virtual IOA, the value shall include
“IBM,vfc-server”.“used-by-rtas”See Definition ColumnPresent if appropriate.“ibm,loc-code”YProperty name specifying the unique and persistent
location code associated with this virtual IOA presented as an
encoded array as with
encode-string. The value shall be of the
form
.“reg”YStandard property name per
,
specifying the register addresses, used as
the unit address (unit ID), associated with this virtual IOA
presented as an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of two sets (two
window panes) of three values (LIOBN, phys, size) encoded as
with
encode-int,
encode-phys, and
encode-int. Of these two triples, the
first describes the window pane used to map server partition
memory, the second is the window pane through which the client
partition maps its memory that it makes available to the server
partition. (Note the mapping between the LIOBN in the second
window pane of a server virtual IOA’s
“ibm,my-dma-window” property
and the corresponding client IOA’s RTCE table is made
when the CRQ successfully completes registration. See
for more information
on window panes.)“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall()“ibm,my-drc-index”For DRPresent if the platform implements DR for this
node.“ibm,vserver”YProperty name specifying that this is a virtual server
node.“ibm,#dma-size-cells”See Definition ColumnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See Definition ColumnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.
Virtual Network Interface Controller (VNIC)This section defines a Virtual Network Interface Controller (VNIC)
interface to a server partition interfacing to a physical Network Interface
Controller (NIC) adapter that allows multiple partitions to share a
physical port. The implementation support is provided by code running in a
server partition that uses the mechanisms of the Synchronous VIO
Infrastructure (or equivalent thereof as seen by the client) to service I/O
requests for code running in a client partition. The client partition appears to enjoy the services of its own
NIC adapter. The terms server and client partitions refer to platform
partitions that are respectively servers and clients of requests, usually
I/O operations, using the physical NIC that is assigned to the server
partition. This allows a platform to have more client partitions than it
may have physical NICs because the client partitions share I/O adapters via
the server partition.The VNIC model makes use of Remote DMA which is built upon the
architecture specified in the following sections:The use of Remote DMA has implications that the physical NIC be able
to do some of its own vitualization. For example, for an Ethernet adapter,
being able to route receive requests, via DMA to the appropriate client
partition, based on the addressing in the incoming packet.VNIC GeneralThis section contains an informative outline of the architectural
intent of the use of VNIC. Other implementations of the server and client
partition code, consistent with this architecture, are possible and may
be preferable.The client partition provides the virtual equivalent of a single
port NIC adapter via each VNIC client IOA. The platform, through the
partition definition, provides means for defining the set of virtual
IOA’s owned by each partition and their respective location codes.
The platform also provides, through partition definition, instructions to
connect each client partition’s VNIC client IOA to a specific
server partition’s VNIC server IOA. The mechanism for specifying
this partition definition is beyond the scope of this architecture. The
human readable handle associated with the partition definition of virtual
IOAs and their associated interconnection and resource configuration is
the virtual location code. The OF unit address (unit ID) remains the
invariant handle upon which the OS builds its “physical to
logical” configuration. The platform also provides a method to
assign unique MAC addresses for each VNIC client adapter. The mechanism
for allocating port names is beyond the scope of this
architecture.The client partition’s device tree contains one or more nodes
notifying the partition that it has been assigned one or more virtual
adapters. The node’s
“type” and
“compatible” properties notify the
partition that the virtual adapter is a VNIC. The unit address of the
node is used by the client partition to map the virtual device(s) to the
OS’s corresponding logical representations. The
“ibm,my-dma-window” property communicates
the size of the RTCE table window panes that the hypervisor has
allocated. The node also specifies the interrupt source number that has
been assigned to the Reliable Command/Response Transport connection and
the RTCE range that the client partition device driver may use to map its
memory for access by the server partition via Logical Remote DMA. The
client partition, uses the hcall()s associated with the Reliable
Command/Response Transport facility to register and deregister its CRQ,
manage notification of responses, and send command requests to the server
partition. The client partition uses the hcall()s associated with the
Subordinate CRQ Transport facility to register and deregister any
sub-CRQs necessary for the operations of the VNIC.The client partition, upon noting the device tree entry for the
virtual adapter, loads the device driver associated with the value of the
“compatible” property. The device driver,
when configured and opened, allocates memory for its CRQ (an array, large
enough for all possible responses, of 16 byte elements), pins the queue
and maps it into the I/O space of the RTCE window specified in the
“ibm,my-dma-window” property using the
standard kernel mapping services that subsequently use the H_PUT_TCE
hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next,
I/O request control blocks (within which the I/O requests commands are
built) are allocated, pinned, and mapped into I/O address space. Finally,
the device driver registers to receive control when the interrupt source
specified in the virtual IOA’s device tree node signals.Once the CRQ is setup, the device driver in the client queues an
Initialization Command/Response with the second byte of
“Initialize” in order to attempt to tell the hosting side
that everything is setup on the hosted side. The response to this send
may be that the send has been dropped or has successfully been sent. If
successful, the sender should expect back an Initialization
Command/Response with a second byte of “Initialization
Complete,” at which time the communication path can be deemed to be
open. If dropped, then the sender waits for the receipt of an
Initialization Command/Response with a second byte of
“Initialize,” at which time an “Initialization
Complete” message is sent, and if that message is sent
successfully, then the communication path can be deemed to be
open.Once the CRQ connection is complete between the client and the
server, the client receives from the server the number of sub-CRQs that
can be supported on the client side. The client allocates memory for the
first sub-CRQ (an array, large enough for all possible responses, of 32
byte elements), pins the queue and maps it into the I/O space of the RTCE
window specified in the
“ibm,my-dma-window” property using the
standard kernel mapping services that subsequently use the H_PUT_TCE
hcall(). The queue is then registered using the H_REG_SUB_CRQ hcall().
This process continues until all desired sub-CRQs are registered or until
the H_REG_SUB_CRQ hcall() indicates that the resources allocated to the
client for sub-CRQs for the virtual IOA have already been allocated
(H_Resource returned). Interrupt numbers for the Sub-CRQs that have been
registered, are returned by the H_REG_SUB_CRQ hcall() (See
).Once all the CRQs and Sub-CRQs are setup, the communications
between the client and server device drivers may commence for purposes of
further setup operations, and then normal I/O requests, error
communications, etc. The protocol for this communications is beyond the
scope of this architecture.VNIC RequirementsThis normative section provides the general requirements for the
support of VNIC.R1--1.For the VNIC option: The platform must implement the
Reliable Command/Response Transport option as defined in
.R1--2.For the VNIC option: The platform must implement the
Subordinate CRQ Transport option as defined in
.R1--3.For the VNIC option: The platform must implement the
Logical Remote DMA option as defined in
.R1--4.For the VNIC option: The platform’s OF device
tree for client partitions must include as a child of the
/vdevice node, at least one node of name
“vnic”.R1--5.For the VNIC option: The platform’s
vnic OF node must contain properties as defined in
(other standard I/O adapter
properties are permissible as appropriate).
Properties of the vnic Node in the OF Device TreeProperty NameRequired for vnic?Required for vnic-server?Definition“name”YValue = “ibm,vnic”YValue = “ibm,vnic-server”Standard property name per
, specifying the
virtual device name.“device_type”YNStandard property name per
, specifying the
virtual device type, the value shall be
“network”.“model”NANAProperty not present.“compatible”YValue includes: “ibm,vnic”YValue includes: “ibm,vnic-server”Standard property name per
, specifying the
programming models that are compatible with this virtual IOA.“used-by-rtas”Present if appropriate.Present if appropriate.“ibm,loc-code”YYProperty name specifying the unique and persistent
location code associated with this virtual IOA.“reg”YYStandard property name per
, specifying the unit
address (unit ID) associated with this virtual IOA presented as
an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“ibm,my-dma-window”YValue = a single tripletYValue = two tripletProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of one or more sets of three values (triplet)
(LIOBN, phys, size) encoded as with
encode-int,
encode-phys, and
encode-int.For the vnic-server the two tripples describe two window panes:
the first describes the window pane used to map server partition memory;
the second is the window pane through which the client partition maps
its memory that it makes available to the server partition.
(Note: the mapping between
the LIOBN in the second window pane of a server virtual IOA’s
“ibm,my-dma-window”
property and the corresponding client IOA’s RTCE table is made when the
CRQ successfully completes registration. See
.)“interrupts”YYStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall().“ibm,my-drc-index”For DRFor DRPresent if the platform implements DR for this
node.“ibm,#dma-size-cells”See definition columnSee definition columnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,#dma-address-cells”See definition columnSee definition columnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that
are used to encode the physical address field of dma-window
properties. This property is present when the dma address
format cannot be derived using the method described in the
definition for the
“ibm,#dma-address-cells” property in
.“local-mac-address”YNAStandard property name per
, specifying the
locally administered MAC addresses are denoted by having the
low order two bits of the high order byte being 0b10.“mac-address”YNAStandard property name per
, specifying the
initial MAC address (may be changed by a VNIC CRQ
command).“supported-network-types”YNAStandard property name as per
.
Reports possible types of “network”
the device can support.“chosen-network-type”YNAStandard property name as per
.
Reports the type of “network” this
device is supporting.“max-frame-size”YNAStandard property name per
, to indicate maximum
packet size.“address-bits”YNAStandard property name per
, to indicate network
address length.“interrupt-ranges”YYStandard property name that defines the interrupt
number(s) and range(s) handled by this device. Subordinate CRQs
associated with this VNIC use interrupt numbers from these
ranges.“ibm,vf-loc-code”NAYVendor unique property name to define the physical device
virtual function upon which the vnic-server runs. The value is
that of the “ibm,loc-code”
property of the physical device virtual function.“ibm,vnic-mode”NAYVendor unique property that represents the operational
mode in which the vnic-server runs.“ibm,vnic-client-mac”NAYVendor unique property that represents a vNIC server's client MAC address.
Virtual Trusted Platform Module (VTPM)This section defines the Virtual Trusted Platform Module (VTPM) option.
Firmware can provide the service of a VTPM device to a partition using the
mechanisms of the Reliable Command/Response Transport and Logical Remote DMA
of the Synchronous VIO Infrastructure. A VTPM device primarily allows VTPM
aware system firmware and operating systems to perform a trusted boot.The VTPM architecture is built upon the architecture specified in the
following sections:VTPM GeneralThis informative section provides an outline of the architectural intent
of the use of the VTPM.The platform, through the partition definition can define multiple VTPMs,
and ensures that only a single VTPM is associated with a partition.The client partition may be assigned various virtual adapters, each with
a corresponding node in the device tree. The node's
“device_type” and
“compatible”
properties may be used to distinguish between adapter types and thus locate a
VTPM. The node's unit address is an invariant handle to the
adapter and given by the
“reg” property.
The “ibm,my-dma-window”
property encodes the adapter's LIOBN and RTCE table size for use with the
CRQ and LDRMA mechanisms. The CRQ's assigned interrupt source number is given
by the node's
“interrupt” property.The presence of a VTPM device tree node causes the client to load a
device driver associated with the node's
“compatible” property.
The driver first allocates and pins memory for the CRQ - an array of
16 byte elements, large enough to contain all possible responses.
The queue is then RTCE mapped using the H_PUT_TCE hcall and the values of the
“ibm,my-dma-window” property.
The CRQ is registered via the H_REG_CRQ hcall,
and the partition may request interrupt notification using the source given by the
“interrupt” property.The driver then follows the VTPM initialization steps as described in
resulting in the allocation and RTCE mapping of memory buffers with which to send/receive TPM commands and responses of the format described in the Trusted Computing Group TPM Specification, version 1.2 [29]. Once initialized the client may send commands by writing them directly to the RTCE-mapped memory buffer and issuing the H_SEND_CRQ hcall with the buffer's I/O address. If successful, the driver awaits an interrupt indicating that a response to the command is available – and is present in the same buffer used for command transmission. Notice that the client does not use LRDMA facilities itself, firmware is the only entity to copy data.
VTPM RequirementsThis normative section provides the general requirements for the support of VTPM.R1--2.For the VTPM option: The platform must implement the
Reliable Command/Response Transport option as defined in
.R1--3.For the VTPM option: The platform must implement the
Logical Remote DMA option as defined in
.In addition to the firmware primitives, and the structures they define,
the partition’s OS needs to know specific information regarding the configuration
of the virtual IOA’s that it has been assigned so that it can load and configure
the correct device driver code. This information is provided by the OF device
tree node associated with the virtual IOA
().
Properties of the vtpm Node in the OF Device TreeProperty NameRequired?Definition“name”YStandard property name per
, specifying the
virtual device name, the value shall be
“vtpm”.“device_type”YStandard property name per
, specifying the
virtual device type, the value shall be
“IBM,vtpm”.“compatible”YStandard property name per
, specifying the
programming models that are compatible with this virtual IOA, the value shall be either
“IBM,vtpm”
for VTPM version 1.2 or
“IBM,vtpm20”
for VTPM version 2.0.“reg”YStandard property name per
, specifying the unit
address (unit ID) associated with this virtual IOA presented as
an encoded array as with
encode-phys of length
“#address-cells” value shall be
0xwhatever (virtual
“reg” property used for unit
address no actual locations used, therefore, the size field has
zero cells (does not exist) as determined by the value of the
“#size-cells” property).“interrupts”YStandard property name specifying the interrupt source
number and sense code associated with this virtual IOA
presented as an encoded array of two cells encoded as with
encode-int with the first cell containing
the interrupt source number, and the second cell containing the
sense code 0 indicating positive edge triggered. The interrupt
source number being the value returned by the H_XIRR or H_IPOLL
hcall().“ibm,phandle”YDevice's phandle encoded with
encode-int –
present only if DRC is enabled..“ibm,my-drc-index”For DRThe integer index for the connector between the device and
its parent – present only if DRC is enabled.“ibm,#dma-address-cells”See definition columnProperty name, to define the package’s dma address
format. The property value specifies the number of cells that are used
to encode the physical address field of dma-window properties. This
property is present when the dma address format cannot be derived
using the method described in the definition for the
“ibm,#dma-address-cells” property
in
.“ibm,#dma-size-cells”See definition columnProperty name, to define the package’s dma address
size format. The property value specifies the number of cells
that are used to encode the size field of dma-window
properties. This property is present when the dma address size
format cannot be derived using the method described in the
definition for the
“ibm,#dma-size-cells” property
in
.“ibm,my-dma-window”YProperty name specifying the DMA window associated with
this virtual IOA presented as an encoded array of three values
(LIOBN, phys, size).“ibm,loc-code”YProperty name specifying the unique and persistent location
code associated with this virtual IOA.“ibm,adjunct-virtual-addresses”YVendor unique property name indicating ranges of the client program virtual address space that are used by the virtual device serving partition adjunct.
See information about the children
of the /vdevice node.