If you can't read please download the document
Upload
aero-plane
View
10.989
Download
4
Embed Size (px)
Citation preview
CPU and Memory Events
Topics
CPU architecture
Error reporting banks
Types of errors and handling
Addressing memory discussion and example
Examples of various error messages
Utilities and programs
X64 DIMM replacement guidelines
CPU Architecture
Opteron Processor Overview
Dual Core Opteron
Cache and Memory
Cache Organisation
Cache Details
L1 64Kbyte per core 2 way set associative
L1 Data cache protected by ECC
L1 Instruction cache protected by parity
L2 cache 16 way set associative
L2 1Mb per core Both data and instructions
L2 Protected by ECC
Least Recently Used (LRU) replacement algorithm
Translation Look Aside Buffer
L1 32 Entries
L1 Fully associative
L2 512 Entries
4 way associative
Traditional Northbridge
Opteron Northbridge
On Processor Die ( Node )
Up to 3 Hyper Transport Link Interfaces
Memory controller
Interface to memory
Interface to CPU cores
ECC errors are detected and corrected here
On dual core Nodes shared between CPUs
Opteron server overview
Rev E CPUs DDR1 memory
Rev F CPUs (M2 systems) DDR2 memory
4 DIMM slots per CPU (at present)
Servers utilise both memory channels in parallel allowing a 128 bit access to memory + 16 ECC bits
Chipkill mode (able to correct up to 4 bit in error if bits lie within nybble boundaries)
Capability to address up to 1TB
Error Reporting Banks
Opteron Error Reporting Banks
Bank 0 Data cache(DC)
Bank 1 Instruction Cache(IC)
Bank 2 Bus Unit (BU)
Bank 3 Load/Store Unit (LS)
Bank 4 Northbridge(NB)
Error Reporting Bank Registers
Machine check control register (MCi_CTL)
Error reporting control register mask (MCi_CTL_MASK)
Machine check status register (MCi_STATUS)
Machine check address register(MCi_ADDR)
Role of registers
MCi_CTL allows control over what errors will be reported
MCi_CTL_MASK allows additional control over the errors reported
Mci_STATUS where error information gets reported eg syndrome, type of error
Mci_ADDR physical address of failure -important in memory errors ( Northbridge - bank 4)
Decoding Mci Status Registers
First discover which CPU or Node is reporting the error and which error bank is reporting
The decode of the status register is dependant on the failing bank
To decode error Often the OS or a package on the OS will do much of the work for you
If you have a Windows system available then consider using MCAT ( machine check analysis tool) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html
Decoding Mci Status Registers Cont
Utilities on web eg parcemce use with caution
Use Infodoc 78336, 82833
Manually use the BIOS and Kernel Developer's Guides ( make sure you use the correct one Note Rev F has a different guide) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_9003,00.html
Open a collaboration task
CHIPKILL + SYNDROMES
In the Opteron world chipkill is ability to correct up to 4 contiguous memory bits
128 data bits + 16 ECC bits = 144 bits
Single symbol correction double symbol detection
1 failing x4 memory chip can generate 16 separate syndromes
Syndromes can identify failing bit or bits within word
Syndromes will tell you which DIMM in a DIMM pair is failing. - They will not identify a DIMM pair or associated CPU
Portion of chipkill syndrome table
128 bit memory word
64 bit memory word
You may see this on workstations
Configurations with only 1 DIMM
64 bits + 8 bits ECC
Can only correct single bits
Detect double bit errors
Syndrome is 8 bits
64 bit word ECC syndrome table
Error Types and handling
Correctable ECC errors
BIOS will log to DMI /SEL during BIOS/POST
It is the responsibility of the OS to handle correctable errors
On V20z/40z nps reports errors to SP if the threshold is exceeded Note threshold does not correspond to DIMM replacement guidelines ( CR 6494195, 6386838) 2 errors in 6 hours NSV 2.4.0.24 will fix this
How if and where correctable ECC errors are reported is dependant on the type and revision of OS and what packages are installed.
Handling Uncorrectable errors
Two main methods.
Sync Flood analogous to SPARC fatal reset
Machine Check exception interrupt which the OS handles (panics)
Sync Flood
Sync Flooding is a HyperTransport method used to stop data propagation in the
case of a serious error.
Device that detects the error initiates sync flood.
All others cease operation, and transmit sync flood packets.
Packets finally reach the South Bridge (eg nVidia CK8-04).
BIOS has Pre-programmed SB to trigger system RESET signal, when sync flood is detected
System reboots
During Boot Block and POST, BIOS analyzes related error bits in all Nodes, reports of Sync Flood reasons
First step in debugging get hold of SEL .
001 | 01/03/2007 | 21:43:00 | OEM #0x12 | | Asserted2101 | OEM record e0 | 00000000040f0c0200400000f22201 | OEM record e0 | 010000000400000000000000002301 | 01/03/2007 | 21:43:15 | Memory | Uncorrectable ECC | Asserted | CPU 1 DIMM 02401 | 01/03/2007 | 21:43:15 | Memory | Memory Device Disabled | Asserted | CPU 1 DIMM 02501 | 01/03/2007 | 21:43:18 | Memory p1.d1.fail | Predictive Failure Asserted2601 | 01/03/2007 | 20:43:12 | System Firmware Progress | Motherboard initialization | Asserted
Sync Flood example SEL
Another example of sync flood error
- not so friendly -
1501 | 04/10/2007 | 04:18:02 | OEM #0x12 | | Asserted1601 | OEM record e0 | 000048000011110020000000001701 | OEM record e0 | 10ab00000008100000060400121801 | OEM record e0 | 10ab00000011110020111100201901 | OEM record e0 | 1800000000f60000010005001b1a01 | OEM record e0 | 180000000000000000dffe00001b01 | OEM record e0 | 1900000000f200002000020c0f1c01 | OEM record e0 | 1a00000000f200001000020c0f1d01 | OEM record e0 | 1b00000000f200003000020c0f1e01 | OEM record e0 | 80004800001111032000000000
Machine check exception
For certain unrecoverable errors Machine Check Exceptions are generated
Generates an interrupt and the OS handles or tries to handle the error eg panics.
Linux machine check exception example
CPU 0: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b600000000000185 at 0000000000000940 Kernel panic: CPU context corrupt
The above is from kernel:
2.4.21-27.0.1.ELsmp #1 SMP
Machine check exception example Solaris
WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0ifying DMI Pool Data ....sched: #mc Machine checkpid=0, pc=0xfffffffffb8233ea, sp=0xfffffe8000293ad8, eflags=0x216cr0: 8005003b cr4: 6f0cr2: 8073c62 cr3: d3a7000 cr8: c rdi: ffffffff812dadf0 rsi: ffffffff815f4df0 rdx: 1000 rcx: 42 r8: 1 r9: 1 rax: fffffe8000293c80 rbx: ffffffff81282e00 rbp: fffffe8000293b10 r10: 1 r11: 1 r12: 0 r13: ffffffff81282e00 r14: ffffffff81283318 r15: fffffe800025db40 fsb: ffffffff80000000 gsb: ffffffff81034000 ds: 43 es: 43 fs: 0 gs: 1c3 trp: 12 err: 0 rip: fffffffffb8233ea cs: 28 rfl: 216 rsp: fffffe8000293ad8
Memory Addressing and Interleaving
Example of a DIMM layout
Contiguous addressing versus Interleaving
Contiguous sequential addresses are allocated to the same rank of chips until the capacity is exhausted and then another rank of chips is addressed
Interleaving Contiguous addresses are switched between different ranks of memory
Performance benefit to interleaving
Good discussion at URL:
http://systems-tsc/twiki/pub/Products/SunFireX4100FaqPts/OpteronMemInterlvNotes.pdf
Interleaving
Memory DIMMs need to be the same + power of 2
Interleave at DIMM level (dual rank)
Interleave at DIMM pair level
Interleave at node level (not so common)
BIOS parameters
Complicates mapping address to DIMM pair
Rev F DIMM Interleave Addresses
Example of addressing
X4100 2 CPUs
4 x 1GB DIMMs per CPU
Micron 18VDDF12872G-40BD3
Dual rank DIMM
8 x 64 Meg memory chips/side + ECC chip
Simplified addressing no interleave
Possible 40 bits 0-39 to address 1TB
128 Bit memory access so first 4 bits is byte address so not used to address memory
Bits 4 -14 Column address
Bits 15 16 Internal bank addressing
Bits 17-29 Row address
Bit 30 Chip select ( other side of DIMM)
Bit 31 Chip select ( other DIMM pair)
Bit 32 Selects other node
Simplified addressing - interleave
Possible 40 bits 0-39 to address 1TB
128 Bit memory access so first 4 bits are byte addresses so not used to address memory
Bits 4 -14 Column address
Bits 15 16 Internal bank addressing
Bit 17 Chip select (swapped with bit 30)
Bit 18 Chip select (swapped with bit 31)
Bits 19-31 Row address (bits 30, 31 swapped with bits 17 and 18)
Bit 32 Selects other node
Memory/PCI Hole
Gap in memory left for legacy I/O devices and drivers that use 32 bit addressing -situated under 4G (0xffffffff)
Can cause RAM to be unavailable
Opterons have capability to map around hole thus allowing all of installed RAM to be visible but this means Node address ranges are altered.
This is known as memory hoisting
For memory hole discussion see URL: http://techfiles.de/dmelanchthon/files/memory_hole.pdf
Affect of memory hole on address ranges
Actual values will depend on configuration. BIOS revision etc
Example is for a X4100 M2 with no HBAs installed,BIOS revision 0ABJX034 running OS Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
Technique to discover memory ranges on CPU for Linux systems
Cd /var/log
Grep -i bootmem *
This is recorded in various files depending on version type of OS most commonly in dmesg
Memory Hole address range
without remapping
Node address range displayed at boot.Each Node has 4GB node 0 has lost memory(a 4G address range would be000000000000000-00000000ffffffff)Memory hole exists between dfffffff and fffffff =20000000[root@va64-x4100f-gmp03 log]# pwd/var/log[root@va64-x4100f-gmp03 log]# grep -i Bootmem mess*Bootmem setup node 0 000000000000000-00000000dfffffffBootmem setup node 1 0000000100000000-00000001ffffffff
Address range with memory remapping around hole (hoisting)
In this case we do not lose the memory. RAM addressingis remapped around the memory hole
so address range on Mode 0 grows by 20000000base + limit of node 1 grows by 20000000
Bootmem setup node 0 0000000000000000-000000011fffffffBootmem setup node 1 0000000120000000-000000021fffffff
Some examples of error reporting
Red Hat 3 Update 2
kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel: ECC syndrome bits e307 kernel: extended error chipkill ecc error kernel: link number 0 kernel: dram scrub error kernel: corrected ecc error kernel: error address valid kernel: error enable kernel: previous error lost kernel: error address 00000000cf31f8f0
Later Red Hat 3 example
kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8
Example of Red Hat 3 GART error
CPU 3: Silent Northbridge MCENorthbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0
Example of EDAC output
EDAC MC0: CE - no information available: k8_edac Error Overflow setEDAC k8 MC0: extended error code: ECC chipkill x4 errorEDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access),cache level(generic)EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label "": k8_edacEDAC MC0: CE - no information available: k8_edac Error Overflow setEDAC k8 MC0: extended error code: ECC chipkill x4 errorEDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access),cache level(generic)
MCE 1HARDWARE ERROR. This is *NOT* a software problem!Please contact your hardware vendorCPU 2 4 northbridge TSC e169139a35188ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic'STATUS d422400040080a13 MCGSTATUS 0
Suse mcelog example kernel 2.6.16.27
Further Suse mcelog example
MCE 31HARDWARE ERROR. This is *NOT* a software problem!Please contact your hardware vendorCPU 3 1 instruction cache TSC 3e2dc434cdb5ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic'STATUS d400400000000853 MCGSTATUS 0
ECC ( non chipkill example)
CPU 2 4 northbridge TSC 3da2afa1102bADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic'STATUS d418c00000000a13 MCGSTATUS 0
Confusing EDAC example note two
MC numbers reporting.
eaebe242 kernel: EDAC k8 MC0: extended error code: ECC erroreaebe242 kernel: EDAC k8 MC0: general bus error:participating processor(local node origin), time-out(no timeout)memory transaction type(generic read), mem or i/o(mem access),cache level(generic)eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8,syndrome 0xf4, row 0, channel 1, label "": k8_edaceaebe242 kernel: MC1: CE - no information available:k8_edac Error Overflow seteaebe242 kernel: EDAC k8 MC0: extended error code: ECC erroreaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout)memory transaction type(generic read), mem or i/o(mem access),cache level(generic)
FMA information examples
This is the same error as the EDAC error example.
# fmdump -v -u 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86TIME UUID SUNW-MSG-IDFeb 18 15:42:41.1662 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=3 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3
fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288 , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.
# fmdump TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 13441a52-c465-629b-ca9d-fc77b0e66354 -------- ---------------------------------------------------------------------- # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354 TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1
Example of FMA detecting CPU error
Solaris handles machine check exception and FMA information
is
available on reboot
SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1,SEVERITY:MajorEVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME:
SOURCE: SunOS, REV: 5.10 Generic_118855-14DESC: Errors have been detected that require a rebootto ensure systemintegrity. See http://www.sun.com/msg/SUNOS-8000-0Gfor more information.Thu Jan 4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save anddiagnose the error telemetryREC-ACTION: Save the error summary below in casetelemetry cannot be saved[Thu Jan 4 21:43:21 2007] [Thu Jan 4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401detector=[> > version=0 scheme="hc" hc-list=[...] ] bank-status=b60000000002017abank-number=2 addr=5a0caddr-valid=1 ip=0 privileged=1ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401
System now panics and then
reboots
panic[cpu1]/thread=fffffe800032fc80: Unrecoverable
Machine-Check Exception
dumping to /dev/dsk/c0t0d0s1, offset 860356608,
SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1,Severity MajorEVENT-TIME: Fri Jan 5 10:11:10 MET 2007PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.netSOURCE: eft, REV: 1.16EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8dDESC: The number of errors associated with this CPUhas exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-67 for more information.RESPONSE: An attempt will be made to remove thisCPU from service.IMPACT: Performance of this system may be affected.REC-ACTION: Schedule a repair procedure to replaceaffected CPU. Use fmdump -v -u to identify the module.
#>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME UUID SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100% fault.cpu.amd.l2cachetag
Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1
Some programs and utilities
HERD
Hardware error report and decode
Installed as RPM on top of SLES and Redhat and
Be provide by Sun
Will report errors to messages file and service processor
Same command line options as mcelog
http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD
mcelog
Linux kernels after 2.6.4 do not print recoverable machine check errors to messages file or kernel log
Instead they are saved into /dev/mcelog
Mcelog read errors from /dev/mcelog and then deletes entries
Typically run as a cron
Eg /usr/sbin/mcelg >> /var/log/mce note this is not collected by sysreport
Red Hat have implemented as a daemon
See Red Hat advisory RHEA-2006-0134-7
Linux kernels after 2.6.4 do not print do not print recoverable machine check errors to messages file or kernel log
Instead they are saved into /dev/mcelog
Mcelog read errors from /dev/mcelog and then deletes entries
Typically run as a cron
Eg /usr/sbin/mcelg >> /var/log/mce
Red hat will/have implemented as a daemon
See Red Hat advisory
mcat
Runs on windows machines
AMD utility to decode machine check status
Decodes Windows event log events
Can be fed status, bank and address to decode errors reported on other machines
Download from AMD
http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html
Newisys decoder
Utility provided by Newisys to identify failing DIMM for V20z/40z http://systems-tsc/twiki/bin/view/Products/ProdTroubleshootingV20z
Can be used with extreme care on on other Rev E systems to decode NorthBridge status and if memory DIMM used on system is the same as stinger can be used to help confirm DIMM.
X64 Memory Replacement Policy
X64 Memory Replacement Policy
Why we expect memory to fail ie a proportion of memory will experience transient correctable memory errors that will not re-occur due to the physics of memory chips
Also analysis has shown that in general, memory does not degrade ie correctable errors do not degenerate into uncorrectable errors
https://onestop/qco/x86dimm/index_x86dimm.shtml
FIN 102195
02195
Three rules to change DIMMs
I can't count
UE failure reported by BIOS/POST
Solaris 10 U 2 change a DIMM pair when the system tells you.
Any UE from systems not running Solaris that you are confident originates from memory
24 errors from a DIMM in 24 hours
Glossary of terms
Glossary of terms
EDAC Error Detection and Correction term used by the Linux community for project to handle and identify hardware based errors formerly known as Bluesmoke
ECC Error Correcting Code. - In Opteron chipkill mode 16 bits stored in memory along with 128 bits of data. These bits are created by generating parity from various data bits in the data word.
Glossary of terms
Syndrome In Opteron chipkill mode a 16 bit value (4 hexadecimal digits) which can identify the type of error and failing bits within a nybble. The syndrome is generated from comparing ( exclusive OR) the ECC code generated on the write to the ECC code generated on the read.
Rank for the purposes of this TOI it can be considered as a set of memory chips which need a separate chip select signal to select the set of chips eg dual ranked DIMMs need two chip select signals sent from the CPU. DIMM interleaving is done between ranks.
Glossary of terms
TLB translation Lookaside Buffer cache in memory used to map virtual addresses to real addresses.
Sun Microsystems, Inc.
Page
Click to edit the title text format
Click to edit the outline text format
Second Outline Level
Sun Confidential: Internal Only
Click to edit the notes format
Page
Click to edit the title text format
Presenters Name
Presenters Title
Presenters Company
Click to edit the notes format
Page