33
Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Embed Size (px)

Citation preview

Page 1: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Error Management Solutions Synergy With WHEA

John StrangeSoftware Design EngineerCore OSJohnStra @ microsoft.comMicrosoft Corporation

Page 2: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Session OutlineSession Outline

WHEA Overview

Hardware Error Sources

Hardware Error Management Solutions

WHEA Integration

PCI Express Advanced Error Reporting (AER) Example

Page 3: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Session GoalsSession Goals

Attendees should leave this session with the following:

A good understanding of: How platform hardware/firmware, device drivers, and error management software integrate with WHEA

Knowledge of where to find resources for WHEA

Page 4: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Overview

Page 5: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Architecture - OverviewArchitecture - Overview

I/O Bus Driver

WheaReportHwErr

HW Error Event Consumer

user

kernel

Platform-specific Hardware Error Driver

Platform (HW/FW)

Other Error Source

texttextPlug-Ins

HAL

MCELLHEH

CPEILLHEH

PCIeLLHEH

LLHEH

ETW Event

Page 6: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Key ComponentsKey Components

Platform Specific Hardware Error Driver (PSHED)

Low-Level Hardware Error Handler (LLHEH)

WheaReportHwError – Entry point to OS common error handling

Error Record – Common OS error record

Error Event Consumers

Page 7: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Error Sources

Page 8: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Hardware Error SourceHardware Error Source

An error source is a mechanism that notifies software of hardware error conditions and provides information to describe the error condition

Notification may be via interrupt, polling of error status registers, or callback from system firmware

Error data may be recorded in hardware registers, mapped to PCI configuration space, provided by a system firmware interface, etc.

Page 9: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Hardware Error Sources and WHEAHardware Error Sources and WHEA

WHEA targets platform-level error sources

Platform-level error sources usually aggregate error reporting for multiple of devices

Error Source Hardware

Machine Check Processor, Cache,TLBs, Memory

Corrected Platform Error

Memory controller

Non-maskable Interrupt IO Bus

PCI Express Device, Root Complex

Page 10: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Managing Error Sources

Page 11: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Managing Hardware Error SourcesManaging Hardware Error Sources

WHEA enables management of error sources

A number of attributes associated with a given error source may be manageable

Platform OEMs specify this functionalityThey can decide which attributes are exposed to be viewed and/or modified

WHEA enables programmatic control over the attributes associated with an error source

Whether an error source is enabled/disabled

Thresholds associated with an error source

Control register settings of a particular error sourceError Severity Mappings

Error Masking Settings

Page 12: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Managing Hardware Error Sources (con’t)Managing Hardware Error Sources (con’t)

OS queries the PSHED for a table of all the error sources on a given platform

PSHED interfaces with the platform to extract this information and return it to the OS

The OS makes this information available to management applications

Some of this information may settable only by privileged entities

These interfaces will be available during OS install, so platform-appropriate settings may be applied during setup

This capability solves BIOS/OS conflicts over error source settings

Page 13: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Hardware Error Management SolutionsHardware Error Management Solutions

Existing hardware error management solutions are necessarily proprietary

Even those based on standards such as the Intelligent Platform Management Interface (IPMI) record error information in proprietary format in the SEL (system event log)

A generic SDR (sensor data record) is used and record size constraints limit the richness of the error records

Proprietary applications can consume and perform management operations on the proprietary error data

These applications retrieve the error information in a proprietary manner – usually via a collections of device drivers that present the information to the management application

Page 14: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Hardware Error Management Solutions (con’t)Hardware Error Management Solutions (con’t)

WHEA enables generic hardware error management solutions

Published error record format

ETW-based error eventing model allows management applications to subscribe for the events in which they are interested

WHEA permits value-add extensibility by having unstructured (e.g. proprietary) error data added to error records

WHEA error records are potentially very rich in content and include OS context information to aid in problem diagnosis and resolution

Page 15: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Integration

Page 16: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA IntegrationWHEA Integration

How solution providers integrate with WHEA?System firmware/platform support

Implement platform interfaces required by WHEA (e.g. Error Source Discovery and Error Record Serialization)

PSHED Plug-insAugment and/or override the behavior of the default per-processor-architecture PSHED

LLHEHsDevice drivers for some hardware error sources may be made WHEA aware to report hardware errors to the system

Consumer ApplicationsUser-mode applications that perform health-monitoring and other higher-level error management functions

Page 17: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Integration - Platform SupportWHEA Integration - Platform Support

OEMs will be required to implement at least minimal WHEA support to obtain Logo

Error Source Discovery

Error Record Serialization

Opportunities exist for even tighter integration with the OS

Adopting the WHEA error record format as the platforms native error record

Improved platform-level mechanisms for reporting error conditions to the OS (e.g. using extended PCI config space and a structured error data format)

Page 18: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Integration - LLHEHsWHEA Integration - LLHEHs

Bus drivers might be in charge of error sources that need to be exposed to WHEA

Endpoint devices are not expected to do this

Device drivers that fall into this category implement LLHEHs which handle errors and report them to the kernel

Page 19: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Integration - PSHED Plug-InsWHEA Integration - PSHED Plug-Ins

The PSHED houses all hardware error related interactions between the OS and the platform

The PSHED represents an opportunity for OEMs to rethink how some error handling features are implemented

Some functionality may be moved into the PSHED rather than BIOS/FW

Portions of the functionality may stay in BIOS/FW and PSHED plug-ins may interface with these functions

Page 20: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

WHEA Integration – Management ApplicationsWHEA Integration – Management Applications

Management applications implement high-level error monitoring, reporting, and potentially recovery capabilities

These applications subscribe to receive error event notifications via ETW

Generic processing of all error events is possible given the common error record format

Extended processing of error events is possible through unstructured (private) error information recorded in the error record

Page 21: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express Example

Page 22: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER ExamplePCI Express AER Example

PCI Express Advanced Error Reporting (AER) represents a good technology to use in an example

This example will show how PCI Express AER support can be integrated into WHEA

Page 23: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – PCI Express AER Example – Platform-Level SupportPlatform-Level Support

The platform BIOS must surface PCI Express AER as a platform error source

Possible mechanisms include: ACPI Table or EFI runtime interface

The platform must grant OS control of PCI Express error handling via ACPI _OSC

Assume our example platform implements some non-standard PCI Express error registers that capture platform-specific information in addition to the standardized AER error registers.

Page 24: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – LLHEHPCI Express AER Example – LLHEH

The PCI bus driver will implement the root port interrupt handler which receives error interrupts

Therefore, the PCI bus driver will implement the LLHEH for PCI Express AER

To accomplish this, the PCI bus driver must…Implement an ErrorSourceInitializer callback routine to initialize error reporting resourcesFrom its DriverEntry routine

Register the ErrorSourceInitializer callback by calling WheaRegisterErrSrcInitializer

After the initializer routine has been called, the bus driver can report hardware errors to the kernel

Page 25: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – LLHEH (con’t)PCI Express AER Example – LLHEH (con’t)

Upon detecting a PCI Express error, the PCI bus driver does the following

Creates and initializes a WHEA_ERROR_PACKET using the error information it extracts from the PCI Express AER error status in extended config space

The driver is responsible for mapping the error severity reported by the device into one of WHEA’s error severity levels

Calls the PSHED’s PshedRetrieveErrorInfo routine, passing a pointer to the WHEA_ERROR_PACKET

Calls WheaReportHwError, supplying a pointer to the WHEA_ERROR_PACKET

Page 26: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – PSHEDPCI Express AER Example – PSHED

Remember, our example platform implements a set of non-standard PCI Express error registers

A PSHED plug-in might participate in the error source discovery functionality to ensure that the OS sizes the WHEA_ERROR_PACKET for the PCI Express error source to accommodate the additional error information

PshedRetrieveErrorInfo is called by the LLHEH when it detects an error condition

A plug-in could extract the information in the non-standard error registers and add that information to the error packet

Page 27: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – PSHED (Con’t)PCI Express AER Example – PSHED (Con’t)

The PSHED will be called by WheaReportHwError to finalize construction of the error record

At this point, a PSHED plug-in could use platform-specific information to populate additional error sections in the error record

Note that the approach suggested gracefully accommodate platform differentiation

An entry-level server line might ship without the PSHED plug-in and its error reporting capabilities would not include the additional non-standard registers

A higher-level server line should ship with the plug-in and therefore offer extended error reporting (and possibly recovery) capabilities

Page 28: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

PCI Express AER Example – ConsumersPCI Express AER Example – Consumers

A targeted consumer (management application) might be written with special knowledge of the information contained in the platform’s non-standard PCI Express error registers

The consumer might implement extended error reporting, health monitoring, and even fail-over services

Page 29: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Call To ActionCall To Action

Send us your questions

Watch for WHEA logo requirements for Windows codenamed “Longhorn”

Evaluate how your products will integrate with WHEA

Page 30: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Community ResourcesCommunity Resources

Windows Hardware & Driver Central (WHDC)www.microsoft.com/whdc/default.mspx

Technical Communitieswww.microsoft.com/communities/products/default.mspx

Non-Microsoft Community Siteswww.microsoft.com/communities/related/default.mspx

Microsoft Public Newsgroupswww.microsoft.com/communities/newsgroups

Technical Chats and Webcastswww.microsoft.com/communities/chats/default.mspx

www.microsoft.com/webcasts

Microsoft Blogswww.microsoft.com/communities/blogs

Page 31: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

Additional ResourcesAdditional Resources

Email: Send feedback and questions to WHEAFB @ microsoft.com

Page 32: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation
Page 33: Error Management Solutions Synergy With WHEA John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation

© 2005 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.