A Protocol for VSCSI Communications
Introduction The purpose of this appendix is to define the protocol used by virtual SCSI (vscsi) client drivers and vscsi server drivers in sufficient detail to ensure compatibility between unlike operating systems implementing these features. The SCSI Architecture Model (SAM-2) defines the following simplified abstract model and terminology for a SCSI system.
SCSI Initiator/Target Architecture
In , the Application Client is the application producing or consuming the data being stored. The SCSI Initiator Port is the virtual scsi client adapter running in the client partition. The Service Delivery System is the Hypervisor. The SCSI Target Port is the vscsi host (vhost) adapter running in the VIO server (VIOS). The Logical Unit is the entity providing the data storage services. Note that the model is not symmetrical. Client adapters may communicate only with host adapters and host adapters may communicate only with client adapters. Each may communicate with a maximum of one partner at any point in time. Client adapters may exist only in client partitions. Host adapters may exist only in VIOSs. A client partition may have multiple client adapters and they may communicate with host adapters in the same or different VIOSs. A SCSI host adapter may have multiple Logical Units defined to it for use. Almost all messages are initiated by the client. The client and host adapters communicate using Command/Response Queues (CRQ) defined earlier in this document. A client may not read or write VIOS memory, it may only write to the VIOS CRQ. The VIOS may read and write to client partition memory, if the client passes the VIOS a DMA mapped address for that memory.
SCSI Remote DMA Protocol (SRP) The protocol used for transferring data between the application client and the logical unit is the SCSI Remote DMA Protocol (SRP), revision 16.a, as defined by the InterNational Committee for Information Technology Standards (INCITS). Copies of the standard are available at the INCITS website at T10.org. The client builds an SRP request in its address space, then DMA maps that request so that the VIOS can access it. The client notifies the VIOS of the request by including that mapped address in a CRQ message. A SCSI Command Data Block (CDB) is encapsulated within the SRP request. Also within the SRP request is a tag field, which is private to the client. The VIOS must not modify that tag value in any way. When the request is complete, the VIOS notifies the client of the completion by including that tag field in the CRQ message to the client. The client then uses that tag value to locate the request being completed. If the SRP request expects to transfer any data, it also contains one of the two types of memory descriptors specified by the SRP standard, to describe the buffer(s) to be used in the data transfer. In the SRP memory descriptor, the virtual address field is the DMA mapped address of the buffer, to be used by the VIOS to transfer the data. The memory handle field is not used and should be initialized to zero. Using the H_SEND_CRQ call, the client sends the SRP request to the VIOS. The first 64 bits of the message describe the type of message, the format, and the length. The second 64 bits of the CRQ message contain the DMA mapped address of the SRP request in the client partition memory. The H_SEND_CRQ call in the client generates a virtual interrupt in the VIOS, if the CRQ is going from empty to non-empty (edge-triggered interrupt). The vhost driver uses the H_COPY_RDMA call and the mapped address to copy the SRP request from client partition memory into VIOS memory, examines the LUN to which the request is addressed, builds the appropriate structure to represent the request, according to the type of backing device, then queues the request to the backing device. The backing device may be an actual physical storage device, a software emulator, or some combination of device and emulation. In the request is an SRP memory descriptor which contains one or more address/length pairs describing one or more buffers in client partition memory address space. The memory handle field of the SRP memory descriptor is not used by vscsi and should be initialized to zero. The virtual address field in the SRP memory descriptor is the DMA mapped address of a buffer in client partition memory that the backing device uses to transfer data. When the backing device services the request, it uses the same DMA services as it would to handle a request that had originated locally on the VIOS. However, DMA services on the VIOS use the H_COPY_RDMA call and the mapped address(es) in the SRP memory descriptor(s) to copy data directly between the client partition and the device, transparent to the device. When the backing device has completed the request, it returns the request along with the results back to the VIOS driver. The VIOS driver builds an SRP response structure and copies that response back into client partition memory over the original SRP request. The SRP response includes any sense data that may have been returned with the request. All virtual devices are “auto-sense” devices. The vhost driver then notifies the client partition of the completed request by using H_SEND_CRQ to place a message in the client CRQ. The first 64 bits of the message describe the type, format, and length of the message. The second 64 bits are the “tag” field from the original SRP request. The client uses the tag to locate the SRP response and processes the response as appropriate. It is important to note that the client partition must not unmap or modify in any way any of the memory associated with the request between the time that it notifies the VIOS of the request and the time that the VIOS notifies the client of the response.
Connection Establishment Before any data can be transferred the two partitions have to establish a connection. Each partition is required to use H_REG_CRQ to register a Command/Response Queue (CRQ) with the Hypervisor to receive messages from the other partition. The size of the queue must be a multiple of 4KB. That memory must be DMA mapped. The size of the CRQ merely determines the number of requests that a client may send to the VIOS in a single burst. The VIOS dequeues the requests as soon as it can, so in evenly balanced systems, where the VIOS has enough CPUs and memory to deal with all of its clients, the size of the CRQ is not a major limiting factor. After H_REG_CRQ returns H_SUCCESS, each partition uses H_SEND_CRQ to attempt to send the Initialization message described previously in this document. This is a race condition that only one partition will win. The first partition to send the Initialization message receives an H_CLOSED return value from the Hypervisor, because the other partition has not yet registered its queue. The winning partition must wait to receive the Initialization message from its partner. The second partition to send the Initialization message receives an H_SUCCESS return value from the Hypervisor. That partition must wait for the Initialization Complete message from its partner. When a partition receives an Initialization message during connection establishment, it must respond with the Initialization Complete message and may then proceed to the next step. When a partition receives the Initialization Complete message during connection establishment, it may then proceed to the next step. The next step in connection establishment is for the client to send one or more of the Management Datagrams (MAD) messages, described in detail later in this chapter. Since this is before the completion of the SRP Login request, no flow control has been established between the client and VIOS, so the client may send only one message at a time and must wait for the response from the VIOS before sending the next one. The exception is the optional MAD_EMPY_IU message. The client may follow that immediately with another message. The VIOS enforces flow control violations by logging and informative error, then closing and reopening the CRQ. The client is required to send the MAD_ADAPTER_INFO_REQUEST. This provides the information that the VIOS displays with the lsmap command. The client may find it useful to save off and display the information that the VIOS returns in the response to the MAD_ADAPTER_INFO_REQUEST. Customers and service personnel frequently find this kind of information useful in unravelling some of the more elaborate configurations. The client is required to send the MAD_CAPABILITIES_EXCHANGE if it wishes to participate in Partition Mobility operations. If it does not send this message, the VIOS does not consider it to be capable of being migrated. If the client wishes to take advantage of the “fast fail” feature, it should send the MAD_ENABLE_FAST_FAIL message before the SRP login request. The last step in connection establishment is the SRP login request. The Target Port Identifier field of the SRP Login request is not used by vscsi and should be initialized to zero. The client uses the SRP login request to specify the size of the largest SRP Information Unit that it will send to the VIOS and the format of the type of memory descriptors it intends to use. The size of the largest SRP Information Unit must also account for the size of the largest Management Datagram that the client expects to send. The VIOS may reject the SRP login if it cannot support the requested options. The VIOS will delay sending the response to the SRP login if it does not have any LUNs defined to it yet. This may be the case if both partitions are booted simultaneously and the VIOS has not completed the configuration process when the client sends the SRP login. If the VIOS accepts the SRP login, it sends the SRP login response and notifies the client of this by placing the tag value from the SRP Login in the CRQ message. The request limit delta field of the SRP login response contains the maximum number of requests that the VIOS will allow the client to have active on the VIOS at any one time. This is the flow control mechanism. If the client violates this limit by sending too many requests, the VIOS will terminate the connection to the client. Note that each SRP response message also contains a request limit delta field. Typically, this is set to 1, to indicate that this completed request means another can be initiated. But if the VIOS has substantial resources added to it, it may increase the number of requests a client may have active, and will do so by setting a value greater than one in this field. Once the SRP login has been accepted, the VIOS may increase the number of requests, but it may never decrease that number until this connection is terminated. After receiving an SRP Login Response for the VIOS, the client may then proceed with normal I/O data traffic. Usually, this starts with device discovery, where the client sends a REPORT_LUNS SCSI request to the VIOS. The VIOS responds with the list of LUNs that have been defined to this host adapter. The client may then use other SCSI requests to determine the identity and capabilities of each LUN. If, after establishing a connection (VIOS sends SRP login response, and client receives it), a partition receives another Initialization message, Initialization Complete message, an SRP Login, or SRP Login response without some indication that the connection has been terminated, usually a Transport Event (described later), that is a protocol violation. Protocol violations are handled by logging an error, then closing and reopening the CRQ. Likewise, after a connection has been terminated, the first messages must be either the Initialization or the Initialization Complete messages, as appropriate. Any other message is a protocol violation. And any SRP message received before a successful SRP Login is a protocol violation.
Connection Termination A connection may be terminated by the client sending the VIOS an SRP_I_LOGOUT Information Unit. The VIOS may send the client an SRP_T_LOGOUT Information Unit, but only if the client has provided resources for this by sending the MAD EMPTY IU first. In the current implementation, neither is used and the drivers just call H_FREE_CRQ to terminate the connection. A connection may also be terminated by the abnormal termination of a partition. When a partition crashes, the Hypervisor invalidates all of the memory mappings for that partition and places a Transport Event in the CRQ of the partner. If the partition that crashed was a client with requests active on the VIOS, when the storage drivers attempt to service those “in flight” requests, they find that the DMA mappings associated with the requests are no longer valid and usually will log one or more errors to that effect. When a partition calls H_FREE_CRQ or crashes, the Hypervisor notifies the partner partition by placing a Transport Event in the partner’s CRQ. The first byte of the Transport Event is set to 0xFF, to indicate that this is a Transport Event. The second byte describes the event. A value of 0x01 indicates that the partner partition failed (crashed). A value of 0x02 indicates that the partner partition called H_FREE_CRQ. A value of 0x06 indicates to a client that it has been migrated. Only clients that send the MAD_CAPABILITIES message are candidates for being migrated. A VIOS cannot be migrated. When a partition receives a Transport Event, it is not required to close its CRQ. It may instead just wait for an Initialization message from the partner partition when it is ready to communicate again.
Client Migration When a client receives the migrated Transport Event, it must unmap any memory associated with any requests currently active on the VIOS. The client will never receive any completions for those requests and must remap and restart them at the end of the migration. Then the client must call H_ENABLE_CRQ until it returns H_SUCCESS. When the CRQ has been successfully enabled, the client sends the Initialization message and waits for the Initialization Complete message. It then goes through the rest of the connection establishment process, followed by the SRP login. After the VIOS sends the SRP Login response, the client may resume normal data transfers, starting with any requests that may have been active on the VIOS when the client was migrated. Note that the partition identification information that the client sends in the MAD_ADAPTER_INFO message immediately after the migration event may be stale and reflect the identification of the original client partition before the migration. A client may register for DLPAR notification of migration, use that notification to obtain the current partition identification, and send another MAD_ADAPTER_INFO message to the VIOS with the correct information.
VSCSI Message Formats All virtual scsi communications between client and server occurs using the Reliable Command/Response Transport and Logical Remote DMA functions defined earlier in this document. No other channels of communication are required to perform virtual SCSI functions. These communications are made up of three classes of messages: Messages contained entirely within a single CRQ message SRP requests and responses, as defined by the SRP standard Management Datagrams
CRQ Message formats CRQ messages are 16 bytes (128 bits) in length. Only the first byte is architected by the Reliable Command/Response Transport specification described earlier in this document. That specification is repeated in . First Byte of the CRQ Message Value Description 0x00 Element is unused -- all other bytes in the element are undefined 0x01 - 0x7F Reserved 0x80 Valid Command/Response Entry -- the second byte defines the entry format 0x81-0xFE Reserved 0xFF Valid Transport Event -- the second byte defines the specific transport event
If the first byte of a CRQ message is 0x80, then it is a valid Command/Response entry and the second byte describes the format of message. Possible values for the second byte of the CRQ message when the first byte is 0x80 are shown in . Second Byte of the CRQ Message Format Byte Value Definition 0x00 Unused 0x01 VSCSI SRP format 0x02 Management Datagram (MAD) format 0x03 i5os private format 0x04 AIX private format 0x05 Linux private format 0x06 Message in CRQ format 0x07 - 0xFF Reserved
If the format byte is 0x01, then the rest of the message is a vscsi SRP request or response message. The rest of the CRQ contents for this type of message is shown in , for messages from the clients, and , for messages from the VIOS. Messages with a format byte of 0x02 are Management Datagram messages, defined later in this chapter. Messages formats of 0x03, 0x04, and 0x05 are reserved for private, Operating System-specific messages, and are currently unused by this implementation. Messages with a format byte of 0x06 are messages contained entirely within the CRQ.
CRQ VSCSI Client Message Format Client messages are sent from the client partitions to the VIOS. shows the format of these messages, CRQ VSCSI Client Message Byte Offset 0 1 2 3 4 5 6 7 0x00 CRQ Valid CRQ Format Reserved Timeout IU Length 0x08 IU Data Pointer (TCE)
For this type of message, the first byte (CRQ Valid) must be 0x80, and the second byte (CRQ Format) must be 0x01. Bytes 6 and 7 of the first long word are the IU Length, the length in bytes of the SRP Information Unit being passed. The second long word, IU Data Pointer, is the DMA mapped address of the SRP Information Unit being passed, typically an SRP Request. The VIOS uses the IU length and IU Data Pointer to copy the SRP Request into VIOS local memory for interpretation and processing. Bytes 4 and 5 of the first long word, Timeout, are an optional suggested timeout value for this request. If this value is greater than zero, then the value may be passed along to the backing device as a suggestion for how long this request is expected to take to complete. The VIOS does not enforce any timeout values, but relies upon the underlaying backing devices. Management Datagram (MAD) messages also use this same format, with the exception that the second byte (CRQ Format) must be set to 0x02. Bytes 6 and 7 of the first long word are the length of the MAD message, and the second long word, IU Data Pointer, is the DMA mapped address of the MAD message being passed. MAD data structures are defined later in this chapter. For MAD messages, the timeout value is not used.
CRQ VSCSI VIOS Message Format VIOS messages are sent from the VIOS to the clients, usually in response to a request from the client. The VIOS message format is shown . CRQ VSCSI VIOS Message Byte Offset 0 1 2 3 4 5 6 7 0x00 CRQ Valid CRQ Format Reserved Status Reserved IU Length 0x08 IU TAG
For this type of message, the first byte, CRQ Valid, must be 0x80. This same type of message is used for SRP Responses and for responses to MAD messages. If this is an SRP Response, the second byte, CRQ Format, is 0x01. If this is the response to a MAD message, the second byte is 0x02. Bytes 6 and 7 of the first long word, IU Length, contain the length of the response. The second long word contains the tag field from the original request. Both the SRP Request data structures and the MAD message data structures contain a tag field for use in this message. The Status field of the VIOS message is for reporting special, non-SCSI status back to the client. This status is used for improving failover times in configurations where the same storage device is visible to this client over multiple adapters or when the same storage device is being shared by multiple clients in clustered configurations. If the client enables the “fast fail” feature using the MAD_ENABLE_FAST_FAIL message, and if the VIOS determines that all paths to a device on that client adapter have failed, the VIOS will report a status of ADAPTER_FAILED (0x10) in response to a request to that device. If the storage devices that the client are using are being shared by other clients, as is the case of an IBM General Parallel File System (GPFS™) configuration, and if the VIOS determines that all error recovery efforts on a device have failed so that there is no point in any more retries from the client, the VIOS will report a status of DEVICE_BUSY (0x08) in response to a request to that device. In both cases (ADAPTER_FAILED and DEVICE_BUSY), the client response should be the same. The device is no longer accessible and the client should abandon any error recovery or attempts to recover access to the device using this client adapter. The client should attempt to failover to another path to the device, using another adapter, if that is possible.
Transport Events If the first byte (CRQ Valid) of the CRQ message is 0xFF, then this message is a Transport Event from the Hypervisor and the connection to the partner has been terminated. The second byte will be the reason for the Transport Event, and may be one of the following values: 0x01 - Partner Failed. The partner partition has crashed. 0x02 - Partner de-registered the CRQ. The partner partition called H_FREE_CRQ for this CRQ. This may be as a result of error recovery, as in the case of a protocol error, or it may be the result of the system administrator removing a client or VIOS adapter. 0x06 - Client has been migrated as the result of a Partition Mobility operation. Only clients can be migrated and only clients that send the MAD_CAPABILITIES message are considered to be candidates for migration.
Messages in CRQs If the first byte (CRQ Valid) of the CRQ message is 0x80, and the second byte (CRQ Format) is 0x06, then this is a message contained entirely within the CRQ. The rest of the message, including the IU Data Pointer, is unused and must be initialized to zero. These messages do not require any resources on the client or VIOS, and are not subject to flow control, so may be sent at any time. However, they should be used sparingly, because they do take up an entry in the CRQ and they do require interrupt processing time to respond to them, The third byte defines the message. Only two messages of this type have been defined to this point: 0xF5 - PING 0xF6 - PING RESPONSE If the VIOS is not able to process interrupts, the client will likely be hung, waiting on a completion from the VIOS. To detect this condition, the client may send a PING to the VIOS. If the VIOS is capable of processing an interrupt, it responds to the PING with a PING RESPONSE, directly at interrupt level. If the client does not receive the PING RESPONSE within a reasonably short period of time, it may choose to declare the VIOS dead and attempt to failover to another client adapter. Likewise, if the VIOS for some reason needs to determine if the client is still alive, it may send a PING to the client. The client should respond as expeditiously as possible, with a PING RESPONSE.
VSCSI Management Datagrams (MADs) VSCSI uses a number of messages that are not defined by the SRP standard. The paradigm used for these messages is the Management Datagram, discussed in the SRP and Fibre Channel specifications. Like all SRP messages, the MADs are initiated by the client partition and the VIOS responds to them. To initiate a MAD, the client sets the valid field to 0x80, sets the format field to 0x02 (MAD_FORMAT), sets the length field to the length of the data structure describing the MAD, sets the ioba field to the mapped memory address of the data structure describing the MAD, and uses the H_SEND_CRQ service provided by the Hypervisor to send the request to the VIOS. Most of these MADs can be initiated any time after the initialization messages (INIT, INIT_COMPLETE) have been exchanged. Some of them are most appropriately done before the SRP_login message and the start of normal data transfer operations. These are: MAD_EMPTY_IU; MAD_ADAPTER_INFO_REQUEST; MAD_CAPABILITIES_EXCHANGE; and MAD_ENABLE_FAST_FAIL. Note that before the SRP_login message, resources allocated by the VIOS for a client are limited so a client should wait for one MAD to complete before issuing another, with the single exception of the MAD_EMPTY_IU message. None of them are required for normal data transfer operations between the client and VIOS. However, the MAD_ADAPTER_INFO_REQUEST provides information that customers find highly desirable, so using it is strongly recommended. In addition, the MAD_ADAPTER_INFO_REQUEST returns the size of the largest data transfer operation that the VIOS will accept from this client. Failure to honor this limit can result in client failure. And the MAD_CAPABILITIES_EXCHANGE message is required before a client is allowed to participate in partition mobility operation. The inter_op structure is used to specify the type of MAD being sent. The type field describes the MAD and will be discussed in the paragraphs that follow. The status field describes the result of the MAD operation. The client is required to initialize the status field to zero. The VIOS responds one of three ways: MAD_NOT_SUPPORTED is returned if the VIOS is down-level. MAD_FAILED is returned in every other situation where the MAD did not succeed. The length field is set to the length of the data structure(s) used in the command. The tag field is reflected back to the client in the response to the MAD. The VIOS uses H_SEND_CRQ to send a response with the format set to 0x02 (MAD_FORMAT) and the ioba field is set to the tag field specified by the client. The type field may be set to one of the following:
#define MAD_EMPTY_IU 0x01 The client sends a MAD_EMPTY_IU command if it wishes to receive an SRP target_logout before the VIOS closes the CRQ. The target_logout SRP response contains the reason that the VIOS is closing the CRQ. The MAD_EMPTY_IU command uses the following data structure: The inter_op structure is initialized with the type field set to 0x01 (MAD_EMPTY_IU), the status field set to zero, the length field set to the size of the mad_empty_iu structure, and the tag field set as described above. The desp field is set to mapped memory address of the SRP_T_LOGOUT response data structure. The client must not unmap, free, or re-use this memory until it receives the SRP target_logout or the CRQ is closed. The port field is unused at this time.
#define MAD_ERROR_LOGGING_REQUEST 0x02 The client sends the MAD_ERROR_LOGGING_REQUEST when it wishes the VIOS to write an entry in the system error log on its behalf. Hardware errors in physical storage components on the VIOS usually result in errors on the client partition using that physical storage. The MAD_ERROR_LOGGING REQUEST places client errors in the system error log in proximity to the original hardware error to enable service personnel to assess the impact of the original hardware error. The MAD_ERROR_LOGGING_REQUEST uses the following data structure: The inter_op structure is initialized with the type field set to 0x02 (MAD_ERROR_LOGGING_REQUEST), the status field set to zero, the length field set to the size of the mad_error_log structure plus the size of the buffer of additional data, if any, and the tag field set as described above. The buffer field points to a mad_error_log structure. The lun field is set to the Logical Unit Number (LUN) of the device on the client that is logging the error. The correlator field is optional. If used, it should have a unique value that can be used to correlate the error message on the client with the error message on the VIOS. The error_id field is set to a client-specific number associated with the error. The buffer_size is set to the length of the buffer of additional data, which is optional. The client_name array is set to the name by which this client adapter instance is known on the client partition, for example “vscsi0”. The device_name array is set to the name by which the device logging the error is known on the client partition, for example “hdisk0”. The partition field is set to the number of the client partition requesting that the error be logged. The flags field specifies the type of data contained in the optional buffer. The buffer, if used, starts immediately after the mad_error_log structure. The buffer is not logged by the VIOS at this time.
#define MAD_ADAPTER_INFO_REQUEST 0x03 The client sends the MAD_ADAPTER_INFO_REQUEST to the VIOS to inform the VIOS of the client’s identity. The VIOS responds with the equivalent information about itself. The VIOS uses the client information provided in the MAD_ADAPTER_INFO_REQUEST for the display in the “lsmap” command. Use of this MAD is not enforced by VIOS. However, customers have found the information useful enough to insist that it be used. The MAD_ADAPTER_INFO_REQUEST may also be used after a Partition Mobility operation to allow the client to update the information on the VIOS, which may have changed during the migration. The MAD_ADAPTER_INFO_REQUEST uses the following data structure: The inter_op structure is initialized with the type field set to 0x03 (MAD_ADAPTER_INFO_REQUEST), the status field set to zero, the length field set to the size of the mad_adapter_information_payload structure, and the tag field set as described above. The buffer field points to mapped memory address of a mad_adapter_information_payload structure. The srp_version field is a NULL-terminated character array with the version number of the SRP standard to which the partition complies. Current versions of the VIOS and clients all support SRP revision 16.a. The VIOS does not validate or enforce this field currently. The partition name is the ASCII string representing the name of the partition from the root node in the Open Firmware device tree. The partition number is the integer number identifying the partition from the root node in the Open Firmware device tree. Note that partition number 0 is reserved for the hypervisor. The mad_version field is set to the version of MAD messages supported by the partition. The MAD messages described in this document is version 1. The VIOS does not currently validate or enforce this version. The os_type field is set to the type of Operating System being run on the partition. The VIOS uses this information to allocate additional resources for client partitions that have unique requirements and to return different values for sense data in error situations. The VIOS has been able to make minor behavior changes to the device on behalf of clients that use this field. The port_max_txu array is used by the VIOS to report the size of the largest single request that it can handle. Currently only the first entry (port_max_txu[0]) is used. The client initializes this field to zero. The VIOS responds with at least a value of 0x40000, meaning that it is prepared to deal with a request to transfer at least 256,000 bytes of data. The VIOS can respond with a larger value, depending on the resources available and the capabilities of the physical device providing storage. NOTE: If the VIOS reports a maximum transfer value larger than the minimum of 0x40000, and subsequently a device which cannot support that larger maximum transfer value is added to the device inventory of this host adapter, the VIOS will log an informative error and not report that new device in a REPORT_LUNS request until the client has issued another MAD adapter information request. This prevents the client from passing a data transfer request to a device which is too large for that device to handle. The VIOS will return such requests with an error. Optical devices typically have minimal maximum transfer values.
#define MAD_CAPABILITIES_EXCHANGE 0x05 The MAD_CAPABILITIES_EXCHANGE command is used to allow the client and VIOS to negotiate support for capabilities that may be required with a partition migration. The data structures used are the capabilities structure, followed by at least one specific capability structure. The client uses a bit-mask to advertise the capabilities that it can support by setting the bits representing those capabilities to one. The VIOS responds by turning off (setting to zero) the bits for any capabilities that it cannot support. This allows clients and VIOSs at a variety of levels to cooperate in the partition migration operation. The client is required to support a minimum level of capabilities in order to be considered to be a candidate for migration. The MAD_CAPABILITIES_EXCHANGE command uses the following data structure: The inter_op field is initialized with the type field set to 0x05 (MAD_CAPABILITIES_EXCHANGE), the status field initialized to zero, the length field set to the size of the capabilities structures being passed, and the tag field set as described above. The capabilities structures must include at least the capabilities structure and the mig_cap structure. The buffer field contains the mapped memory address of a buffer containing these structures. The flags field is always set to at least CAP_LIST_SUPPORTED by the client. If the client is sending this command as the result of a successful partition migration operation, it should also set the CLIENT_MIGRATED flag. If the client is sending this command as the result of a VIOS reboot or the VIOS has reset the CRQ, it should also set the CLIENT_RECONNECT flag. If the VIOS cannot support all of the capabilities in the list passed by the client, it will turn off the CAP_LIST_SUPPORTED flag. If the VIOS overwrites some of the data in the capabilities list, it will set the CAP_LIST_DATA flag. The name array is filled with the NULL-terminated string representing the name by which this client adapter instance is known on the client partition, for example “vscsi0”. The loc array is filled with the NULL-terminated string from the “loc-code” field of the adapter node in the Open Firmware device tree for this client adapter, for example “U9117.MMA.107086C-V6-C5-T1”. Following the capabilities structure is a list of capabilities to be negotiated. Capabilities currently supported by the VIOS are MIGRATION_CAPABILITIES and RESERVATION_CAPABILITIES. The capability_common structure is included in each capability structure and describes the type of capability being negotiated. The cap_type field is set to the type of capability. MIGRATION_CAPABILITIES and RESERVATION_CAPABILITIES are the only types of capabilities currently supported. The length field is set to the size of the capabilities structure, currently either mig_cap or reserve_cap. The server_support field is initialized by the client to 1. If the VIOS does not support that capability, it clears the field. The capabilities structure used for negotiating migration capabilities is as follows: The ecl field contains the effective capability level. The client sets it to the current migration capability level that this client is capable of supporting. If this level is lower than the level that the VIOS can support or higher than the VIOS currently supports, the VIOS sets the server_support to SERVER_CAP_DATA, sets the ecl field to the lowest level it can support or the level currently supported, as appropriate, and sets flags field of the capabilities structure to CAP_LIST_DATA, to inform the client of the difference in the levels of migration capabilities supported. Currently, the only migration capability level supported is 1. The structure used in negotiating reservation capabilities is as follows: If the client is capable of breaking and re-establishing SCSI-2 reservations after a migration event, it should set the type field to CLIENT_RESERVE_SCSI_2. Otherwise, it should initialize the type field to zero.
#define MAD_PHYS_ADAP_INFO_REQUEST 0x06 The MAD_PHYS_ADAP_INFO_REQUEST returns data about the physical adapter to which the target device is attached, if the device supports it. The only device currently supporting this request is virtual tape. The data structure used with the MAD_PHYS_ADAP_INFO_REQUEST is as follows: The client initializes the inter_op field, with the type set to 0x06 (MAD_PHYS_ADAP_INFO_REQUEST), the status field initialized to zero, the length field set to the size of the mad_phys_adapter_info structure, the tag field set as described above, and the buffer field set to the mapped memory address of a mad_phys_adapter_info structure. The client sets the lun field to the Logical Unit Number (LUN) of the virtual device for which it is requesting the physical adapter information, and it sets the version to 0x01 (MAD_PHYS_ADAP_INFO_VERSION). If the target device supports returning the physical adapter information, the VIOS copies the Field Replaceable Unit (FRU) part number, the FRU serial number, and the physical location code into the appropriate arrays and returns that information to the client. This information is intended for use by customer service engineers, to assist them in repairing physical tape devices.
#define MAD_TAPE_PASSTHROUGH_REQUEST 0x07 The MAD_TAPE_PASSTHROUGH_REQUEST enables or disables SCSI command data blocks (CDBs) to be passed directly to the physical tape device driver without examination or emulation by the VIOS drivers. The structure used with the MAD_TAPE_PASSTHROUGH_REQUEST is as follows: The client initializes the inter_op structure by setting the type field to 0x07 (MAD_TAPE_PASSTHROUGH_REQUEST), setting the status field to zero, setting the length field to the size of the mad_tape_passthrough structure, and setting the tag field as described above. The lun field is set to the Logical Unit Number of a virtual tape device on this client adapter. The version is set to 0x00000001 (MAD_TAPE_PASSTHRU_VERSION). The passThru is set to either 0x00000001 (TAPE_PASSTHROUGH_ENABLE) or 0x00000002 (TAPE_PASSTHROUGH_DISABLE). When tape passthrough is enabled, the SCSI Command Data Blocks are sent directly to the tape head driver, without examination or emulation by the VIOS drivers.
#define MAD_ENABLE_FAST_FAIL 0x08 The MAD_ENABLE_FAST_FAIL command enables the VIOS to provide a hint to the client that a physical device is no longer accessible so that a failover to alternate paths, if any, should be attempted. The only structure used with the MAD_ENABLE_FAST_FAIL command is the inter_op structure. The type field is set to 0x08 (MAD_ENABLE_FAST_FAIL), the status field is initialized to zero, the length field is set to the size of the inter_op structure, and the tag field is set as described above. When the MAD_ENABLE_FAST_FAIL has completed successfully and the VIOS determines that a device is no longer responding, when the VIOS is completing an I/O request for that device back to the client, the VIOS will set the status field in the CRQ message to 0x10 (ADAPTER_FAILED), in addition to returning the normal device error and sense data. Fast fail is disabled by closing the CRQ. Two additional messages may be exchanged between clients and a VIOS - PING and PING_RESPONSE. If a partition needs to know if the other partition is still functional and at least able to respond to an interrupt, it can send a PING message to the other partition. The other partition should respond with a PING_RESPONSE. These are very lightweight messages that require no resources. They fit entirely within the first 64-bit quantity of the CRQ message. The PING_RESPONSE should be sent from the interrupt code, immediately after receiving the PING. To send a PING, the valid bit is set to one, the CRQ format field is set to 0x06 (MESSAGE_IN_CRQ), and the status field is set to 0xF5 (PING). To send a PING_RESPONSE, the valid bit is set to one, the CRQ format field is set to 0x06 (MESSAGE_IN_CRQ), and the status field is set to 0xF6 (PING_RESPONSE). It is strongly recommended that PING messages be used very sparingly. One way to fill a CRQ with ping messages is to halt the VIOS in kdb while the AIX client has requests active on it.