Introduction

Introduction RTAS provides a mechanism which helps OSs avoid the need for platform-dependent code that checks for, or recovers from, errors or exceptional conditions. The mechanism is used to return information about hardware errors which have occurred as well as information about non-error events, such as environmental conditions (for example, temperature or voltage out-of-bounds) which may need OS attention. This permits RTAS to pass hardware event information to the OS in a way which is abstracted from the platform hardware. This mechanism primarily presents itself to the OS via two RTAS functions, event-scan and check-exception, which are described further in . A further RTAS function, rtas-last-error, is also provided to return information about hardware failures detected specifically within an RTAS call. The event-scan function is called periodically to check for the presence or past occurrence of a hardware event, such as a soft failure or voltage condition, which did not cause a program exception or interrupt (for example, an ECC error detected and corrected by background scrubbing activity). The check-exception function is called to provide further detail on what platform event has occurred when certain exceptions or interrupts are signaled. The events reported by these two functions are mutually exclusive on any given platform; that is, a platform may choose to notify the OS of a particular event type either through event-scan or through an interrupt and check-exception, but not both. Since firmware is platform-specific, it can examine hardware registers, can often diagnose many kinds of hardware errors down to a root cause, and may even perform some very limited kinds of error recovery on behalf of the OS. The reporting format, described in this chapter, permits firmware to report the type of error which has occurred, what entities in the platform were involved in the error, and whether firmware has successfully recovered from the error without the need for further OS involvement. Firmware may not, in many cases, be able to determine all the details of an error, so there are also returned values which indicate this fact. Firmware may optionally provide extended error diagnostic information, as described in . The abstractions provided by this architecture enable the handling of most platform errors and events without integrating platform-specific code into each supported OS. Architecture Note: It is not a goal of the firmware to diagnose all hardware failures. Most I/O device failures, for example, will be detected and recovered by an associated device driver. Firmware attempts to determine the cause of a problem and report what it finds, to aid the end user (by providing meaningful diagnostic data for messages) and to prevent the loss of error syndrome information. Firmware is never required to correct any problem, but in some cases may attempt to do so. System vendors who want more extensive error diagnosis may create OS error handlers which contain specific hardware knowledge, or could use firmware to collect a minimum set of error information which could then be used by diagnostics to further analyze the cause of the error.