Loading Documentation/edac.txt +110 −0 Original line number Diff line number Diff line Loading @@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com> 7 Dec 2005 17 Jul 2007 Updated (c) Mauro Carvalho Chehab <mchehab@redhat.com> 05 Aug 2009 Nehalem interface EDAC is maintained and written by: Loading Loading @@ -717,3 +719,111 @@ unique drivers for their hardware systems. The 'test_device_edac' sample driver is located at the bluesmoke.sourceforge.net project site for EDAC. ======================================================================= NEHALEM USAGE OF EDAC APIs This chapter documents some EXPERIMENTAL mappings for EDAC API to handle Nehalem EDAC driver. They will likely be changed on future versions of the driver. Due to the way Nehalem exports Memory Controller data, some adjustments were done at i7core_edac driver. This chapter will cover those differences 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect (QPI). At the driver, the term "socket" means one QPI. It should also be associated with the CPU physical socket. Each MC have 3 physical read channels, 3 physical write channels and 3 logic channels. The driver currenty sees it as just 3 channels. Each channel can have up to 3 DIMMs. The minimum known unity is DIMMs. There are no information about csrows. As EDAC API maps the minimum unity is csrows, the driver exports one DIMM per csrow. Currently, it also exports the several memory controllers as just one. This limit will be removed on future versions of the driver. 2) Nehalem MC has the hability to generate errors. The driver implements this functionality via some error injection nodes: For injecting a memory error, there are some sysfs nodes, under /sys/devices/system/edac/mc/mc0/: inject_addrmatch: Controls the error injection mask register. It is possible to specify several characteristics of the address to match an error code: dimm = the affected dimm. Numbers are relative to a channel; rank = the memory rank; channel = the channel that will generate an error; bank = the affected bank; page = the page address; column (or col) = the address column. each of the above values can be set to "any" to match any valid value. At driver init, all values are set to any. For example, to generate an error at rank 1 of dimm 2, for any channel, any bank, any page, any column: echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch To return to the default behaviour of matching any, you can do: echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch inject_eccmask: specifies what bits will have troubles, inject_section: specifies what ECC cache section will get the error: 3 for both 2 for the highest 1 for the lowest inject_socket: specifies what QPI (or processor socket) will generate the error. on Xeon 35xx, it should be 0. on Xeon 55xx, it should be 0 or 1. inject_type: specifies the type of error, being a combination of the following bits: bit 0 - repeat bit 1 - ecc bit 2 - parity inject_enable starts the error generation when something different than 0 is written. All inject vars can be read. root permission is needed for write. Datasheet states that the error will only be generated after a write on an address that matches inject_addrmatch. It seems, however, that reading will also produce an error. For example, the following code will generate an error for any write access at socket 0, on any DIMM/address on channel 2: echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch echo 2 >/sys/devices/system/edac/mc/mc0/inject_type echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask echo 3 >/sys/devices/system/edac/mc/mc0/inject_section echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null The generated error message will look like: EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 3) Nehalem specific Corrected Error memory counters Nehalem have some registers to count memory errors, reporting it on a way that it is different from what EDAC API allows. Due to that, a separate sysfs note were created to handle such counters. They can be read by looking at the contents of "corrected_error_counts" counter: $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts dimm0: 15866 dimm1: 0 dimm2: 27285 Loading
Documentation/edac.txt +110 −0 Original line number Diff line number Diff line Loading @@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com> 7 Dec 2005 17 Jul 2007 Updated (c) Mauro Carvalho Chehab <mchehab@redhat.com> 05 Aug 2009 Nehalem interface EDAC is maintained and written by: Loading Loading @@ -717,3 +719,111 @@ unique drivers for their hardware systems. The 'test_device_edac' sample driver is located at the bluesmoke.sourceforge.net project site for EDAC. ======================================================================= NEHALEM USAGE OF EDAC APIs This chapter documents some EXPERIMENTAL mappings for EDAC API to handle Nehalem EDAC driver. They will likely be changed on future versions of the driver. Due to the way Nehalem exports Memory Controller data, some adjustments were done at i7core_edac driver. This chapter will cover those differences 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect (QPI). At the driver, the term "socket" means one QPI. It should also be associated with the CPU physical socket. Each MC have 3 physical read channels, 3 physical write channels and 3 logic channels. The driver currenty sees it as just 3 channels. Each channel can have up to 3 DIMMs. The minimum known unity is DIMMs. There are no information about csrows. As EDAC API maps the minimum unity is csrows, the driver exports one DIMM per csrow. Currently, it also exports the several memory controllers as just one. This limit will be removed on future versions of the driver. 2) Nehalem MC has the hability to generate errors. The driver implements this functionality via some error injection nodes: For injecting a memory error, there are some sysfs nodes, under /sys/devices/system/edac/mc/mc0/: inject_addrmatch: Controls the error injection mask register. It is possible to specify several characteristics of the address to match an error code: dimm = the affected dimm. Numbers are relative to a channel; rank = the memory rank; channel = the channel that will generate an error; bank = the affected bank; page = the page address; column (or col) = the address column. each of the above values can be set to "any" to match any valid value. At driver init, all values are set to any. For example, to generate an error at rank 1 of dimm 2, for any channel, any bank, any page, any column: echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch To return to the default behaviour of matching any, you can do: echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch inject_eccmask: specifies what bits will have troubles, inject_section: specifies what ECC cache section will get the error: 3 for both 2 for the highest 1 for the lowest inject_socket: specifies what QPI (or processor socket) will generate the error. on Xeon 35xx, it should be 0. on Xeon 55xx, it should be 0 or 1. inject_type: specifies the type of error, being a combination of the following bits: bit 0 - repeat bit 1 - ecc bit 2 - parity inject_enable starts the error generation when something different than 0 is written. All inject vars can be read. root permission is needed for write. Datasheet states that the error will only be generated after a write on an address that matches inject_addrmatch. It seems, however, that reading will also produce an error. For example, the following code will generate an error for any write access at socket 0, on any DIMM/address on channel 2: echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch echo 2 >/sys/devices/system/edac/mc/mc0/inject_type echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask echo 3 >/sys/devices/system/edac/mc/mc0/inject_section echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null The generated error message will look like: EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 3) Nehalem specific Corrected Error memory counters Nehalem have some registers to count memory errors, reporting it on a way that it is different from what EDAC API allows. Due to that, a separate sysfs note were created to handle such counters. They can be read by looking at the contents of "corrected_error_counts" counter: $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts dimm0: 15866 dimm1: 0 dimm2: 27285