Cpu And Memory Events

CPU and Memory Events

[email protected]

Topics

CPU architecture

Error reporting banks

Types of errors and handling

Addressing memory discussion and example

Examples of various error messages

Utilities and programs

X64 DIMM replacement guidelines

CPU Architecture

Opteron Processor Overview

Dual Core Opteron

Cache and Memory

Cache Organisation

Cache Details

L1 64Kbyte per core 2 way set associative

L1 Data cache protected by ECC

L1 Instruction cache protected by parity

L2 cache 16 way set associative

L2 1Mb per core Both data and instructions

L2 Protected by ECC

Least Recently Used (LRU) replacement algorithm

Translation Look Aside Buffer

L1 32 Entries

L1 Fully associative

L2 512 Entries

4 way associative

Traditional Northbridge

Opteron Northbridge

On Processor Die ( Node )

Up to 3 Hyper Transport Link Interfaces

Memory controller

Interface to memory

Interface to CPU cores

ECC errors are detected and corrected here

On dual core Nodes shared between CPUs

Opteron server overview

Rev E CPUs DDR1 memory

Rev F CPUs (M2 systems) DDR2 memory

4 DIMM slots per CPU (at present)

Servers utilise both memory channels in parallel allowing a 128 bit access to memory + 16 ECC bits

Chipkill mode (able to correct up to 4 bit in error if bits lie within nybble boundaries)

Capability to address up to 1TB

Error Reporting Banks

Opteron Error Reporting Banks

Bank 0 Data cache(DC)

Bank 1 Instruction Cache(IC)

Bank 2 Bus Unit (BU)

Bank 3 Load/Store Unit (LS)

Bank 4 Northbridge(NB)

Error Reporting Bank Registers

Machine check control register (MCi_CTL)

Error reporting control register mask (MCi_CTL_MASK)

Machine check status register (MCi_STATUS)

Machine check address register(MCi_ADDR)

Role of registers

MCi_CTL allows control over what errors will be reported

MCi_CTL_MASK allows additional control over the errors reported

Mci_STATUS where error information gets reported eg syndrome, type of error

Mci_ADDR physical address of failure -important in memory errors ( Northbridge - bank 4)

Decoding Mci Status Registers

First discover which CPU or Node is reporting the error and which error bank is reporting

The decode of the status register is dependant on the failing bank

To decode error Often the OS or a package on the OS will do much of the work for you

If you have a Windows system available then consider using MCAT ( machine check analysis tool) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html

Decoding Mci Status Registers Cont

Utilities on web eg parcemce use with caution

Use Infodoc 78336, 82833

Manually use the BIOS and Kernel Developer's Guides ( make sure you use the correct one Note Rev F has a different guide) http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_9003,00.html

Open a collaboration task

CHIPKILL + SYNDROMES

In the Opteron world chipkill is ability to correct up to 4 contiguous memory bits

128 data bits + 16 ECC bits = 144 bits

Single symbol correction double symbol detection

1 failing x4 memory chip can generate 16 separate syndromes

Syndromes can identify failing bit or bits within word

Syndromes will tell you which DIMM in a DIMM pair is failing. - They will not identify a DIMM pair or associated CPU

Portion of chipkill syndrome table
128 bit memory word

64 bit memory word

You may see this on workstations

Configurations with only 1 DIMM

64 bits + 8 bits ECC

Can only correct single bits

Detect double bit errors

Syndrome is 8 bits

64 bit word ECC syndrome table

Error Types and handling

Correctable ECC errors

BIOS will log to DMI /SEL during BIOS/POST

It is the responsibility of the OS to handle correctable errors

On V20z/40z nps reports errors to SP if the threshold is exceeded Note threshold does not correspond to DIMM replacement guidelines ( CR 6494195, 6386838) 2 errors in 6 hours NSV 2.4.0.24 will fix this

How if and where correctable ECC errors are reported is dependant on the type and revision of OS and what packages are installed.

Handling Uncorrectable errors

Two main methods.

Sync Flood analogous to SPARC fatal reset

Machine Check exception interrupt which the OS handles (panics)

Sync Flood

Sync Flooding is a HyperTransport method used to stop data propagation in the

case of a serious error.

Device that detects the error initiates sync flood.

All others cease operation, and transmit sync flood packets.

Packets finally reach the South Bridge (eg nVidia CK8-04).

BIOS has Pre-programmed SB to trigger system RESET signal, when sync flood is detected

System reboots

During Boot Block and POST, BIOS analyzes related error bits in all Nodes, reports of Sync Flood reasons

First step in debugging get hold of SEL .

001 | 01/03/2007 | 21:43:00 | OEM #0x12 | | Asserted2101 | OEM record e0 | 00000000040f0c0200400000f22201 | OEM record e0 | 010000000400000000000000002301 | 01/03/2007 | 21:43:15 | Memory | Uncorrectable ECC | Asserted | CPU 1 DIMM 02401 | 01/03/2007 | 21:43:15 | Memory | Memory Device Disabled | Asserted | CPU 1 DIMM 02501 | 01/03/2007 | 21:43:18 | Memory p1.d1.fail | Predictive Failure Asserted2601 | 01/03/2007 | 20:43:12 | System Firmware Progress | Motherboard initialization | Asserted

Sync Flood example SEL

Another example of sync flood error
- not so friendly -

1501 | 04/10/2007 | 04:18:02 | OEM #0x12 | | Asserted1601 | OEM record e0 | 000048000011110020000000001701 | OEM record e0 | 10ab00000008100000060400121801 | OEM record e0 | 10ab00000011110020111100201901 | OEM record e0 | 1800000000f60000010005001b1a01 | OEM record e0 | 180000000000000000dffe00001b01 | OEM record e0 | 1900000000f200002000020c0f1c01 | OEM record e0 | 1a00000000f200001000020c0f1d01 | OEM record e0 | 1b00000000f200003000020c0f1e01 | OEM record e0 | 80004800001111032000000000

Machine check exception

For certain unrecoverable errors Machine Check Exceptions are generated

Generates an interrupt and the OS handles or tries to handle the error eg panics.

Linux machine check exception example

CPU 0: Machine Check Exception: 0000000000000004 CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b600000000000185 at 0000000000000940 Kernel panic: CPU context corrupt

The above is from kernel:

2.4.21-27.0.1.ELsmp #1 SMP

Machine check exception example Solaris

WARNING: MCE: Bank 2: error code 0x863, mserrcode = 0x0ifying DMI Pool Data ....sched: #mc Machine checkpid=0, pc=0xfffffffffb8233ea, sp=0xfffffe8000293ad8, eflags=0x216cr0: 8005003b cr4: 6f0cr2: 8073c62 cr3: d3a7000 cr8: c rdi: ffffffff812dadf0 rsi: ffffffff815f4df0 rdx: 1000 rcx: 42 r8: 1 r9: 1 rax: fffffe8000293c80 rbx: ffffffff81282e00 rbp: fffffe8000293b10 r10: 1 r11: 1 r12: 0 r13: ffffffff81282e00 r14: ffffffff81283318 r15: fffffe800025db40 fsb: ffffffff80000000 gsb: ffffffff81034000 ds: 43 es: 43 fs: 0 gs: 1c3 trp: 12 err: 0 rip: fffffffffb8233ea cs: 28 rfl: 216 rsp: fffffe8000293ad8

Memory Addressing and Interleaving

Example of a DIMM layout

Contiguous addressing versus Interleaving

Contiguous sequential addresses are allocated to the same rank of chips until the capacity is exhausted and then another rank of chips is addressed

Interleaving Contiguous addresses are switched between different ranks of memory

Performance benefit to interleaving

Good discussion at URL:

http://systems-tsc/twiki/pub/Products/SunFireX4100FaqPts/OpteronMemInterlvNotes.pdf

Interleaving

Memory DIMMs need to be the same + power of 2

Interleave at DIMM level (dual rank)

Interleave at DIMM pair level

Interleave at node level (not so common)

BIOS parameters

Complicates mapping address to DIMM pair

Rev F DIMM Interleave Addresses

Example of addressing

X4100 2 CPUs

4 x 1GB DIMMs per CPU

Micron 18VDDF12872G-40BD3

Dual rank DIMM

8 x 64 Meg memory chips/side + ECC chip

Simplified addressing no interleave

Possible 40 bits 0-39 to address 1TB

128 Bit memory access so first 4 bits is byte address so not used to address memory

Bits 4 -14 Column address

Bits 15 16 Internal bank addressing

Bits 17-29 Row address

Bit 30 Chip select ( other side of DIMM)

Bit 31 Chip select ( other DIMM pair)

Bit 32 Selects other node

Simplified addressing - interleave

Possible 40 bits 0-39 to address 1TB

128 Bit memory access so first 4 bits are byte addresses so not used to address memory

Bits 4 -14 Column address

Bits 15 16 Internal bank addressing

Bit 17 Chip select (swapped with bit 30)

Bit 18 Chip select (swapped with bit 31)

Bits 19-31 Row address (bits 30, 31 swapped with bits 17 and 18)

Bit 32 Selects other node

Memory/PCI Hole

Gap in memory left for legacy I/O devices and drivers that use 32 bit addressing -situated under 4G (0xffffffff)

Can cause RAM to be unavailable

Opterons have capability to map around hole thus allowing all of installed RAM to be visible but this means Node address ranges are altered.

This is known as memory hoisting

For memory hole discussion see URL: http://techfiles.de/dmelanchthon/files/memory_hole.pdf

Affect of memory hole on address ranges

Actual values will depend on configuration. BIOS revision etc

Example is for a X4100 M2 with no HBAs installed,BIOS revision 0ABJX034 running OS Red Hat Enterprise Linux AS release 4 (Nahant Update 4)

Technique to discover memory ranges on CPU for Linux systems

Cd /var/log

Grep -i bootmem *

This is recorded in various files depending on version type of OS most commonly in dmesg

Memory Hole address range
without remapping

Node address range displayed at boot.Each Node has 4GB node 0 has lost memory(a 4G address range would be000000000000000-00000000ffffffff)Memory hole exists between dfffffff and fffffff =20000000[root@va64-x4100f-gmp03 log]# pwd/var/log[root@va64-x4100f-gmp03 log]# grep -i Bootmem mess*Bootmem setup node 0 000000000000000-00000000dfffffffBootmem setup node 1 0000000100000000-00000001ffffffff

Address range with memory remapping around hole (hoisting)

In this case we do not lose the memory. RAM addressingis remapped around the memory hole

so address range on Mode 0 grows by 20000000base + limit of node 1 grows by 20000000

Bootmem setup node 0 0000000000000000-000000011fffffffBootmem setup node 1 0000000120000000-000000021fffffff

Some examples of error reporting

Red Hat 3 Update 2

kernel: CPU 0: Silent Northbridge MCE kernel: Northbridge status 9443c100e3080a13 kernel: ECC syndrome bits e307 kernel: extended error chipkill ecc error kernel: link number 0 kernel: dram scrub error kernel: corrected ecc error kernel: error address valid kernel: error enable kernel: previous error lost kernel: error address 00000000cf31f8f0

Later Red Hat 3 example

kernel: CPU 3: Silent Northbridge MCE kernel: Northbridge status d4194000:9b080a13 kernel: Error chipkill ecc error kernel: ECC error syndrome 9b32 kernel: bus error local node response, request didn't time out kernel: generic read kernel: memory access, level generic kernel: link number 0 kernel: corrected ecc error kernel: error overflow kernel: previous error lost kernel: NB error address 0000000ef28df0d8

Example of Red Hat 3 GART error

CPU 3: Silent Northbridge MCENorthbridge status a60000010005001b processor context corrupt error address valid error uncorrected previous error lost GART TLB error generic level generic error address 000000007ffe40f0 extended error gart error link number 0 err cpu1 processor context corrupt error address valid error uncorrected previous error lost error address 000000007ffe40f0

Example of EDAC output

EDAC MC0: CE - no information available: k8_edac Error Overflow setEDAC k8 MC0: extended error code: ECC chipkill x4 errorEDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access),cache level(generic)EDAC MC0: CE page 0x1fe8e0, offset 0x128, grain 8, syndrome 0x3faf, row 3, channel 1, label "": k8_edacEDAC MC0: CE - no information available: k8_edac Error Overflow setEDAC k8 MC0: extended error code: ECC chipkill x4 errorEDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access),cache level(generic)

MCE 1HARDWARE ERROR. This is *NOT* a software problem!Please contact your hardware vendorCPU 2 4 northbridge TSC e169139a35188ADDR fa00f7f8 Northbridge Chipkill ECC error Chipkill ECC syndrome = 4044 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic'STATUS d422400040080a13 MCGSTATUS 0

Suse mcelog example kernel 2.6.16.27

Further Suse mcelog example

MCE 31HARDWARE ERROR. This is *NOT* a software problem!Please contact your hardware vendorCPU 3 1 instruction cache TSC 3e2dc434cdb5ADDR fa378ac0 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic'STATUS d400400000000853 MCGSTATUS 0

ECC ( non chipkill example)

CPU 2 4 northbridge TSC 3da2afa1102bADDR f9076000 Northbridge ECC error ECC syndrome = 31 bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic'STATUS d418c00000000a13 MCGSTATUS 0

Confusing EDAC example note two
MC numbers reporting.

eaebe242 kernel: EDAC k8 MC0: extended error code: ECC erroreaebe242 kernel: EDAC k8 MC0: general bus error:participating processor(local node origin), time-out(no timeout)memory transaction type(generic read), mem or i/o(mem access),cache level(generic)eaebe242 kernel: MC1: CE page 0x25a58c, offset 0x688, grain 8,syndrome 0xf4, row 0, channel 1, label "": k8_edaceaebe242 kernel: MC1: CE - no information available:k8_edac Error Overflow seteaebe242 kernel: EDAC k8 MC0: extended error code: ECC erroreaebe242 kernel: EDAC k8 MC0: general bus error: participating processor(local node origin), time-out(no timeout)memory transaction type(generic read), mem or i/o(mem access),cache level(generic)

FMA information examples

This is the same error as the EDAC error example.

# fmdump -v -u 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86TIME UUID SUNW-MSG-IDFeb 18 15:42:41.1662 3dadae66-a6e0-67fc-ecf4-d9b7d46aea86 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=3 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=3

fmd: [ID 441519 daemon.error] SUNW-MSG-ID: AMD-8000-3K, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Sat Mar 10 00:52:13 MET 2007 PLATFORM: Sun Fire X4100 Server, CSN: 0606AN1288 , HOSTNAME: siegert SOURCE: eft, REV: 1.16 EVENT-ID: 13441a52-c465-629b-ca9d-fc77b0e66354 DESC: The number of errors associated with this memory module has exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-3K for more information. AUTO-RESPONSE: Pages of memory associated with this memory module are being removed from service as errors are reported. IMPACT: Total system memory capacity will be reduced as pages are retired. REC-ACTION: Schedule a repair procedure to replace the affected memory module. Use fmdump -v -u to identify the module.

# fmdump TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K # fmadm faulty STATE RESOURCE / UUID -------- ---------------------------------------------------------------------- degraded mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 13441a52-c465-629b-ca9d-fc77b0e66354 -------- ---------------------------------------------------------------------- # fmdump -v -u 13441a52-c465-629b-ca9d-fc77b0e66354 TIME UUID SUNW-MSG-ID Mar 10 00:52:13.2822 13441a52-c465-629b-ca9d-fc77b0e66354 AMD-8000-3K 100% fault.memory.dimm_ck Problem in: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1 Affects: mem:///motherboard=0/chip=0/memory-controller=0/dimm=1 FRU: hc:///motherboard=0/chip=0/memory-controller=0/dimm=1

Example of FMA detecting CPU error
Solaris handles machine check exception and FMA information is
available on reboot

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1,SEVERITY:MajorEVENT-TIME: 0x459d66e9.0xbf18650 (0x687a83db95e45) i86pc, CSN: -, HOSTNAME:

SOURCE: SunOS, REV: 5.10 Generic_118855-14DESC: Errors have been detected that require a rebootto ensure systemintegrity. See http://www.sun.com/msg/SUNOS-8000-0Gfor more information.Thu Jan 4 21:43:21 2007]AUTO-RESPONSE: Solaris will attempt to save anddiagnose the error telemetryREC-ACTION: Save the error summary below in casetelemetry cannot be saved[Thu Jan 4 21:43:21 2007] [Thu Jan 4 21:43:21 2007]ereport.cpu.amd.bu.l2t_par ena=7a83db8bc8500401detector=[> > version=0 scheme="hc" hc-list=[...] ] bank-status=b60000000002017abank-number=2 addr=5a0caddr-valid=1 ip=0 privileged=1ereport.cpu.amd.bu.l2t_par ena=7a83db9517700401

System now panics and then
reboots

panic[cpu1]/thread=fffffe800032fc80: Unrecoverable
Machine-Check Exception
dumping to /dev/dsk/c0t0d0s1, offset 860356608,

SUNW-MSG-ID: AMD-8000-67, TYPE: Fault, VER: 1,Severity MajorEVENT-TIME: Fri Jan 5 10:11:10 MET 2007PLATFORM: Sun Fire X4200 Server, CSN: 0000000000 , HOSTNAME: z-app1.vpv.no1.asap-asp.netSOURCE: eft, REV: 1.16EVENT-ID: bc534eb7-ca58-ecbf-b225-ddbb79045d8dDESC: The number of errors associated with this CPUhas exceeded acceptable levels. Refer to http://sun.com/msg/AMD-8000-67 for more information.RESPONSE: An attempt will be made to remove thisCPU from service.IMPACT: Performance of this system may be affected.REC-ACTION: Schedule a repair procedure to replaceaffected CPU. Use fmdump -v -u to identify the module.

#>fmdump -v -u bc534eb7-ca58-ecbf-b225-ddbb79045d8d TIME UUID SUNW-MSG-ID Jan 05 10:11:10.6392 bc534eb7-ca58-ecbf-b225-ddbb79045d8d AMD-8000-67 100% fault.cpu.amd.l2cachetag

Problem in: hc:///motherboard=0/chip=1/cpu=0 Affects: cpu:///cpuid=1 FRU: hc:///motherboard=0/chip=1

Some programs and utilities

HERD

Hardware error report and decode

Installed as RPM on top of SLES and Redhat and

Be provide by Sun

Will report errors to messages file and service processor

Same command line options as mcelog

http://nsgtwiki.sfbay.sun.com/twiki/bin/view/Galaxy/HERD

mcelog

Linux kernels after 2.6.4 do not print recoverable machine check errors to messages file or kernel log

Instead they are saved into /dev/mcelog

Mcelog read errors from /dev/mcelog and then deletes entries

Typically run as a cron

Eg /usr/sbin/mcelg >> /var/log/mce note this is not collected by sysreport

Red Hat have implemented as a daemon

See Red Hat advisory RHEA-2006-0134-7

Linux kernels after 2.6.4 do not print do not print recoverable machine check errors to messages file or kernel log

Instead they are saved into /dev/mcelog

Mcelog read errors from /dev/mcelog and then deletes entries

Typically run as a cron

Eg /usr/sbin/mcelg >> /var/log/mce

Red hat will/have implemented as a daemon

See Red Hat advisory

mcat

Runs on windows machines

AMD utility to decode machine check status

Decodes Windows event log events

Can be fed status, bank and address to decode errors reported on other machines

Download from AMD

http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_871_9033,00.html

Newisys decoder

Utility provided by Newisys to identify failing DIMM for V20z/40z http://systems-tsc/twiki/bin/view/Products/ProdTroubleshootingV20z

Can be used with extreme care on on other Rev E systems to decode NorthBridge status and if memory DIMM used on system is the same as stinger can be used to help confirm DIMM.

X64 Memory Replacement Policy

X64 Memory Replacement Policy

Why we expect memory to fail ie a proportion of memory will experience transient correctable memory errors that will not re-occur due to the physics of memory chips

Also analysis has shown that in general, memory does not degrade ie correctable errors do not degenerate into uncorrectable errors

https://onestop/qco/x86dimm/index_x86dimm.shtml

FIN 102195

02195

Three rules to change DIMMs
I can't count

UE failure reported by BIOS/POST

Solaris 10 U 2 change a DIMM pair when the system tells you.

Any UE from systems not running Solaris that you are confident originates from memory

24 errors from a DIMM in 24 hours

Glossary of terms

Glossary of terms

EDAC Error Detection and Correction term used by the Linux community for project to handle and identify hardware based errors formerly known as Bluesmoke

ECC Error Correcting Code. - In Opteron chipkill mode 16 bits stored in memory along with 128 bits of data. These bits are created by generating parity from various data bits in the data word.

Glossary of terms

Syndrome In Opteron chipkill mode a 16 bit value (4 hexadecimal digits) which can identify the type of error and failing bits within a nybble. The syndrome is generated from comparing ( exclusive OR) the ECC code generated on the write to the ECC code generated on the read.

Rank for the purposes of this TOI it can be considered as a set of memory chips which need a separate chip select signal to select the set of chips eg dual ranked DIMMs need two chip select signals sent from the CPU. DIMM interleaving is done between ranks.

Glossary of terms

TLB translation Lookaside Buffer cache in memory used to map virtual addresses to real addresses.

Sun Microsystems, Inc.

Page

Click to edit the title text format

Click to edit the outline text format

Second Outline Level

Sun Confidential: Internal Only

Click to edit the notes format

Page

Click to edit the title text format

Presenters Name

Presenters Title

Presenters Company

Click to edit the notes format

Page

Documents

Cpu And Memory Events