changeset 731:69e10455038e

PCI-Express AER implemetation: aer howto document

Backported from upstream Linux.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
author Keir Fraser <keir.fraser@citrix.com>
date Mon Nov 24 11:00:47 2008 +0000 (2008-11-24)
parents 832aac894efd
children 5f10331cb88b
files Documentation/pcieaer-howto.txt
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/Documentation/pcieaer-howto.txt	Mon Nov 24 11:00:47 2008 +0000
     1.3 @@ -0,0 +1,253 @@
     1.4 +   The PCI Express Advanced Error Reporting Driver Guide HOWTO
     1.5 +		T. Long Nguyen	<tom.l.nguyen@intel.com>
     1.6 +		Yanmin Zhang	<yanmin.zhang@intel.com>
     1.7 +				07/29/2006
     1.8 +
     1.9 +
    1.10 +1. Overview
    1.11 +
    1.12 +1.1 About this guide
    1.13 +
    1.14 +This guide describes the basics of the PCI Express Advanced Error
    1.15 +Reporting (AER) driver and provides information on how to use it, as
    1.16 +well as how to enable the drivers of endpoint devices to conform with
    1.17 +PCI Express AER driver.
    1.18 +
    1.19 +1.2 Copyright  Intel Corporation 2006.
    1.20 +
    1.21 +1.3 What is the PCI Express AER Driver?
    1.22 +
    1.23 +PCI Express error signaling can occur on the PCI Express link itself
    1.24 +or on behalf of transactions initiated on the link. PCI Express
    1.25 +defines two error reporting paradigms: the baseline capability and
    1.26 +the Advanced Error Reporting capability. The baseline capability is
    1.27 +required of all PCI Express components providing a minimum defined
    1.28 +set of error reporting requirements. Advanced Error Reporting
    1.29 +capability is implemented with a PCI Express advanced error reporting
    1.30 +extended capability structure providing more robust error reporting.
    1.31 +
    1.32 +The PCI Express AER driver provides the infrastructure to support PCI
    1.33 +Express Advanced Error Reporting capability. The PCI Express AER
    1.34 +driver provides three basic functions:
    1.35 +
    1.36 +-	Gathers the comprehensive error information if errors occurred.
    1.37 +-	Reports error to the users.
    1.38 +-	Performs error recovery actions.
    1.39 +
    1.40 +AER driver only attaches root ports which support PCI-Express AER
    1.41 +capability.
    1.42 +
    1.43 +
    1.44 +2. User Guide
    1.45 +
    1.46 +2.1 Include the PCI Express AER Root Driver into the Linux Kernel
    1.47 +
    1.48 +The PCI Express AER Root driver is a Root Port service driver attached
    1.49 +to the PCI Express Port Bus driver. If a user wants to use it, the driver
    1.50 +has to be compiled. Option CONFIG_PCIEAER supports this capability. It
    1.51 +depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
    1.52 +CONFIG_PCIEAER = y.
    1.53 +
    1.54 +2.2 Load PCI Express AER Root Driver
    1.55 +There is a case where a system has AER support in BIOS. Enabling the AER
    1.56 +Root driver and having AER support in BIOS may result unpredictable
    1.57 +behavior. To avoid this conflict, a successful load of the AER Root driver
    1.58 +requires ACPI _OSC support in the BIOS to allow the AER Root driver to
    1.59 +request for native control of AER. See the PCI FW 3.0 Specification for
    1.60 +details regarding OSC usage. Currently, lots of firmwares don't provide
    1.61 +_OSC support while they use PCI Express. To support such firmwares,
    1.62 +forceload, a parameter of type bool, could enable AER to continue to
    1.63 +be initiated although firmwares have no _OSC support. To enable the
    1.64 +walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
    1.65 +when booting kernel. Note that forceload=n by default.
    1.66 +
    1.67 +2.3 AER error output
    1.68 +When a PCI-E AER error is captured, an error message will be outputed to
    1.69 +console. If it's a correctable error, it is outputed as a warning.
    1.70 +Otherwise, it is printed as an error. So users could choose different
    1.71 +log level to filter out correctable error messages.
    1.72 +
    1.73 +Below shows an example.
    1.74 ++------ PCI-Express Device Error -----+
    1.75 +Error Severity          : Uncorrected (Fatal)
    1.76 +PCIE Bus Error type     : Transaction Layer
    1.77 +Unsupported Request     : First
    1.78 +Requester ID            : 0500
    1.79 +VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h
    1.80 +TLB Header:
    1.81 +04000001 00200a03 05010000 00050100
    1.82 +
    1.83 +In the example, 'Requester ID' means the ID of the device who sends
    1.84 +the error message to root port. Pls. refer to pci express specs for
    1.85 +other fields.
    1.86 +
    1.87 +
    1.88 +3. Developer Guide
    1.89 +
    1.90 +To enable AER aware support requires a software driver to configure
    1.91 +the AER capability structure within its device and to provide callbacks.
    1.92 +
    1.93 +To support AER better, developers need understand how AER does work
    1.94 +firstly.
    1.95 +
    1.96 +PCI Express errors are classified into two types: correctable errors
    1.97 +and uncorrectable errors. This classification is based on the impacts
    1.98 +of those errors, which may result in degraded performance or function
    1.99 +failure.
   1.100 +
   1.101 +Correctable errors pose no impacts on the functionality of the
   1.102 +interface. The PCI Express protocol can recover without any software
   1.103 +intervention or any loss of data. These errors are detected and
   1.104 +corrected by hardware. Unlike correctable errors, uncorrectable
   1.105 +errors impact functionality of the interface. Uncorrectable errors
   1.106 +can cause a particular transaction or a particular PCI Express link
   1.107 +to be unreliable. Depending on those error conditions, uncorrectable
   1.108 +errors are further classified into non-fatal errors and fatal errors.
   1.109 +Non-fatal errors cause the particular transaction to be unreliable,
   1.110 +but the PCI Express link itself is fully functional. Fatal errors, on
   1.111 +the other hand, cause the link to be unreliable.
   1.112 +
   1.113 +When AER is enabled, a PCI Express device will automatically send an
   1.114 +error message to the PCIE root port above it when the device captures
   1.115 +an error. The Root Port, upon receiving an error reporting message,
   1.116 +internally processes and logs the error message in its PCI Express
   1.117 +capability structure. Error information being logged includes storing
   1.118 +the error reporting agent's requestor ID into the Error Source
   1.119 +Identification Registers and setting the error bits of the Root Error
   1.120 +Status Register accordingly. If AER error reporting is enabled in Root
   1.121 +Error Command Register, the Root Port generates an interrupt if an
   1.122 +error is detected.
   1.123 +
   1.124 +Note that the errors as described above are related to the PCI Express
   1.125 +hierarchy and links. These errors do not include any device specific
   1.126 +errors because device specific errors will still get sent directly to
   1.127 +the device driver.
   1.128 +
   1.129 +3.1 Configure the AER capability structure
   1.130 +
   1.131 +AER aware drivers of PCI Express component need change the device
   1.132 +control registers to enable AER. They also could change AER registers,
   1.133 +including mask and severity registers. Helper function
   1.134 +pci_enable_pcie_error_reporting could be used to enable AER. See
   1.135 +section 3.3.
   1.136 +
   1.137 +3.2. Provide callbacks
   1.138 +
   1.139 +3.2.1 callback reset_link to reset pci express link
   1.140 +
   1.141 +This callback is used to reset the pci express physical link when a
   1.142 +fatal error happens. The root port aer service driver provides a
   1.143 +default reset_link function, but different upstream ports might
   1.144 +have different specifications to reset pci express link, so all
   1.145 +upstream ports should provide their own reset_link functions.
   1.146 +
   1.147 +In struct pcie_port_service_driver, a new pointer, reset_link, is
   1.148 +added.
   1.149 +
   1.150 +pci_ers_result_t (*reset_link) (struct pci_dev *dev);
   1.151 +
   1.152 +Section provides more detailed info on when to call
   1.153 +reset_link.
   1.154 +
   1.155 +3.2.2 PCI error-recovery callbacks
   1.156 +
   1.157 +The PCI Express AER Root driver uses error callbacks to coordinate
   1.158 +with downstream device drivers associated with a hierarchy in question
   1.159 +when performing error recovery actions.
   1.160 +
   1.161 +Data struct pci_driver has a pointer, err_handler, to point to
   1.162 +pci_error_handlers who consists of a couple of callback function
   1.163 +pointers. AER driver follows the rules defined in
   1.164 +pci-error-recovery.txt except pci express specific parts (e.g.
   1.165 +reset_link). Pls. refer to pci-error-recovery.txt for detailed
   1.166 +definitions of the callbacks.
   1.167 +
   1.168 +Below sections specify when to call the error callback functions.
   1.169 +
   1.170 + Correctable errors
   1.171 +
   1.172 +Correctable errors pose no impacts on the functionality of
   1.173 +the interface. The PCI Express protocol can recover without any
   1.174 +software intervention or any loss of data. These errors do not
   1.175 +require any recovery actions. The AER driver clears the device's
   1.176 +correctable error status register accordingly and logs these errors.
   1.177 +
   1.178 + Non-correctable (non-fatal and fatal) errors
   1.179 +
   1.180 +If an error message indicates a non-fatal error, performing link reset
   1.181 +at upstream is not required. The AER driver calls error_detected(dev,
   1.182 +pci_channel_io_normal) to all drivers associated within a hierarchy in
   1.183 +question. for example,
   1.184 +EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
   1.185 +If Upstream port A captures an AER error, the hierarchy consists of
   1.186 +Downstream port B and EndPoint.
   1.187 +
   1.188 +A driver may return PCI_ERS_RESULT_CAN_RECOVER,
   1.190 +whether it can recover or the AER driver calls mmio_enabled as next.
   1.191 +
   1.192 +If an error message indicates a fatal error, kernel will broadcast
   1.193 +error_detected(dev, pci_channel_io_frozen) to all drivers within
   1.194 +a hierarchy in question. Then, performing link reset at upstream is
   1.195 +necessary. As different kinds of devices might use different approaches
   1.196 +to reset link, AER port service driver is required to provide the
   1.197 +function to reset link. Firstly, kernel looks for if the upstream
   1.198 +component has an aer driver. If it has, kernel uses the reset_link
   1.199 +callback of the aer driver. If the upstream component has no aer driver
   1.200 +and the port is downstream port, we will use the aer driver of the
   1.201 +root port who reports the AER error. As for upstream ports,
   1.202 +they should provide their own aer service drivers with reset_link
   1.203 +function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
   1.204 +reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
   1.205 +to mmio_enabled.
   1.206 +
   1.207 +3.3 helper functions
   1.208 +
   1.209 +3.3.1 int pci_find_aer_capability(struct pci_dev *dev);
   1.210 +pci_find_aer_capability locates the PCI Express AER capability
   1.211 +in the device configuration space. If the device doesn't support
   1.212 +PCI-Express AER, the function returns 0.
   1.213 +
   1.214 +3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
   1.215 +pci_enable_pcie_error_reporting enables the device to send error
   1.216 +messages to root port when an error is detected. Note that devices
   1.217 +don't enable the error reporting by default, so device drivers need
   1.218 +call this function to enable it.
   1.219 +
   1.220 +3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
   1.221 +pci_disable_pcie_error_reporting disables the device to send error
   1.222 +messages to root port when an error is detected.
   1.223 +
   1.224 +3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
   1.225 +pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
   1.226 +error status register.
   1.227 +
   1.228 +3.4 Frequent Asked Questions
   1.229 +
   1.230 +Q: What happens if a PCI Express device driver does not provide an
   1.231 +error recovery handler (pci_driver->err_handler is equal to NULL)?
   1.232 +
   1.233 +A: The devices attached with the driver won't be recovered. If the
   1.234 +error is fatal, kernel will print out warning messages. Please refer
   1.235 +to section 3 for more information.
   1.236 +
   1.237 +Q: What happens if an upstream port service driver does not provide
   1.238 +callback reset_link?
   1.239 +
   1.240 +A: Fatal error recovery will fail if the errors are reported by the
   1.241 +upstream ports who are attached by the service driver.
   1.242 +
   1.243 +Q: How does this infrastructure deal with driver that is not PCI
   1.244 +Express aware?
   1.245 +
   1.246 +A: This infrastructure calls the error callback functions of the
   1.247 +driver when an error happens. But if the driver is not aware of
   1.248 +PCI Express, the device might not report its own errors to root
   1.249 +port.
   1.250 +
   1.251 +Q: What modifications will that driver need to make it compatible
   1.252 +with the PCI Express AER Root driver?
   1.253 +
   1.254 +A: It could call the helper functions to enable AER in devices and
   1.255 +cleanup uncorrectable status register. Pls. refer to section 3.3.
   1.256 +