41
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill http://www.cs.wisc.edu/multifacet June 9th, 2003 Software bugs cost time & money Hardware is getting cheaper Use hardware to aid software debugging?

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay

Min Xu, Rastislav Bodik, Mark D. Hillhttp://www.cs.wisc.edu/multifacet

June 9th, 2003

Software bugs cost time & money Hardware is getting cheaper Use hardware to aid software

debugging?

Xu et al. ISCA'03: Flight Data Recorder 2

Brief Overview

Approach: Full-system Record-Replay– Add H/W “Flight Data Recorder”– Target cache-coherence multiprocessor server– Enables S/W deterministic replay

Full-system Evaluation: Low Overhead– Piggyback on coherence protocol: little extra

H/W– Non-trivial recording interval: 1 second– Negligible runtime overhead: less than 2%– Can be “Always On”

Xu et al. ISCA'03: Flight Data Recorder 3

Outline

Overview– Why Deterministic Replay?– The Debugging Scenario– The Solution

Recording MultithreadingRecording System State & I/OEvaluationConclusions

Efficient Recording

With full-system commercial workloads

Xu et al. ISCA'03: Flight Data Recorder 4

Why Deterministic Replay?

Software Bugs Happens In the Field– Differences between development &

deployment– Data races (Web server, Database)– I/O interactions (OS, Device Driver)

Debugging Usually happens In the Lab– Need to replay the buggy execution

Use Core Dump?– Captures the final application state– Not enough for “race” bugs

Need Better “Core Dump”– Enable faithfully replaying prior to the

failure

Xu et al. ISCA'03: Flight Data Recorder 5

The Debugging Scenario

Recorder

Crash

Dump “Core”

P1

P2

P3

P4

Checkpoint B Checkpoint C

Store log A Store log B Store log C

Checkpoint A

Crash

Read Checkpoint B

Replaying fromlog B, CReplayer

Xu et al. ISCA'03: Flight Data Recorder 6

The Solution

Online Recorder– Like airplane flight data recorder– “Always on” even on deployed system– H/W based (no change to S/W)

• Transparent to S/W• Minimal performance impact

Offline Replayer– Post-mortem replay of pre-crash execution– Possibly on a different machine off-site– Based on existing technology

• i.e. Simics full-system simulator

Focus of this work

Not emphasized in this work

Xu et al. ISCA'03: Flight Data Recorder 7

Outline

OverviewRecording Multithreading

– What to record?– An example– Practical recorder hardware

Recording System State & I/OEvaluationConclusions

Efficient Recording

Xu et al. ISCA'03: Flight Data Recorder 8

What to Record?

Multithreading Problem– Record order of instruction

interleaving

Assume Sequential Consistency (SC)– Accesses (appear to have) total order

Xu et al. ISCA'03: Flight Data Recorder 9

Previous Record-Replay Approaches

InstantReplay ’87– Record order or memory accesses– overhead may affect program behavior

Netzer ’93– Record optimal trace– too expensive to keep track of all memory locations

Bacon & Goldstein ’91– Record memory bus transactions with hardware– high logging bandwidth

RecPlay ’00– Record only synchronizations– Not deterministic if have data races

Xu et al. ISCA'03: Flight Data Recorder 10

Our Approach

Uses existing cache coherence hardware

– Low overhead, not affect program behavior

– Works for program with races– Adapts Netzer’s algorithm in hardware– only record sync. if data race free

An Example– Progressively refine the recording

algorithm

Xu et al. ISCA'03: Flight Data Recorder 11

Example: Record SC Order

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

Xu et al. ISCA'03: Flight Data Recorder 12

Example: Record SC Order

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

i:4j:15j:15i:5

i:5j:16

i:6j:17

i:7j:18

j:16i:6

j:17i:7

Need to add processor instruction count (IC)The very same interleaving is recorded, but …

Xu et al. ISCA'03: Flight Data Recorder 13

Example: Record Word Conflict Order

i:4j:15j:15i:7

i:7j:18

i:5j:21

i:6j:22

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

Recording just word conflict can enable deterministic replay

Hard to remember word accesses and too many arcs …

Xu et al. ISCA'03: Flight Data Recorder 14

Example: Record Block Conflict Order

i:4j:15j:15i:7

i:7j:18

i:5j:21

i:6j:22

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

Xu et al. ISCA'03: Flight Data Recorder 15

Example: Record Block Conflict Order

i:4j:15j:15i:7

i:7j:18

i:5j:21

i:6j:22

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

Need to remember last accessing IC in the cache

i:6j:21

But, can we do better?

Xu et al. ISCA'03: Flight Data Recorder 16

Example: Apply Transitive Reduction

i:4j:15j:15i:7

i:7j:18

i:5j:21

i:6j:22

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

i:6j:21

Xu et al. ISCA'03: Flight Data Recorder 17

Three arcs! No need to know syncsAutomatic sync only for race free program

Example: Apply Transitive Reduction

i:4j:15j:15i:7

i:7j:18

i:5j:21

i:6j:22

4 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i j

i:3j:21

Xu et al. ISCA'03: Flight Data Recorder 18

Practical Recorder Hardware

Processor– instruction count

• 4 bytes per processor

Cache– last access instruction count

• 6.25% space overhead

Coherence Controller– vector of instruction counters

• 3×4 bytes per processor for 4-way multiprocessor

Finite Cache, Out-of-Order, Prefetch, etc.– Recorder still applicable– Details in the paper

Xu et al. ISCA'03: Flight Data Recorder 19

Further DetailsAt each processor j:

– IC = inst count of last committed inst by j– VIC[P] = latest ICs received by each proc– CIC[M] = IC of last load/store of block b in j’s L1

On commit, IC++; if load(b) then CIC[b] = IC

i sends arc start (i,CIC_i[b]) on coherence reply

On coherence reply receive, arc end is (j,IC+1)– If CIC_i[b] > VIC[i] then

log arc; VIC[i]=CIC_i[b]

Xu et al. ISCA'03: Flight Data Recorder 20

Example Transitive Reduction

i:4j:15j:15i:7

i:7j:184 Flag=1

5 X1:=5 15 $r1:=Flag

6 X2:=6 16 Bneq $r1,$r0,-1

7 Flag:=0 17 Nop

18 $r1:=Flag

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

i jIC

IC

LOG LOG

VICi: 7

VICj: 15

CIC{x1,x2}: 6

i,6 6 < 7, soignore

coherencetraffic

?

Xu et al. ISCA'03: Flight Data Recorder 21

RelaxationExact: coherence reply sends (i,CIC[b])Safe: send (i,x) for any x: CIC[b] ≤ x ≤ IC

Exact: on receive, add arc (x, IC’+1)Safe: add arc (x,y) for any y: IC+1 ≤ y ≤

IC’+1

x

CIC[b]

IC

y (e.g., IC+1)IC’+1 (read b)

x

Finite Caches

Speculative Processors

Xu et al. ISCA'03: Flight Data Recorder 22

Outline

Overview

Recording MultithreadingRecording System States & I/O

– SafetyNet checkpoint hardware– Interrupts, I/O, DMA

EvaluationConclusions

Xu et al. ISCA'03: Flight Data Recorder 23

SafetyNet Checkpoint Hardware

Problem– To beginning of “replay” interval– Logically take a snapshot of the

system

Solution– Adapt SafetyNet [Sorin et al. ISCA ‘02]

• Processor Checkpointing• Memory Incremental logging• Slightly modified for longer interval

Xu et al. ISCA'03: Flight Data Recorder 24

Recording I/O

Interrupts– Not exceptions– Record Interrupt type & IC

Instruction I/O– Load: record values– Store: ignored

DMA– Record input values– Record ordering: as pseudo thread

Xu et al. ISCA'03: Flight Data Recorder 25

Outline

Overview

Recording Memory RacesRecording System State & I/OEvaluation

– An example system– Simulation methods– Runtime, log size

Conclusions

With full-system commercial workloads

Xu et al. ISCA'03: Flight Data Recorder 26

Target System

Commercial Server H/W– Sequential Consistent CC-NUMA– Full I/O: Interrupt, DMA, etc.– Simulation system (Simics + Memory

Simulator)• 4 way in-order issue, 1 GHz, 4 processors• 128KB I/D L1, 4MB L2, MOSI directory protocol

Commercial Server S/W– Unmodified commercial server benchmarks

• Apache, Slash, SPEC JBB, OLTP

Xu et al. ISCA'03: Flight Data Recorder 27

CC-NUMA MP

An Example System

MemoryBanks

DMA Interface

Core

Cache(s)

CacheController

DirectoryD

ata

Com

pres

sor (

LZ77

)

RecorderMemory

DM

A C

ontent&

Order

Interrupts, I/O

Cache Checkpoint

Memory Races

Mem

ory

Che

ckpo

int

Xu et al. ISCA'03: Flight Data Recorder 28

Runtime Overhead

Slowdown– Less than 2%– statistically

insignificant for 2 workloads

– No problem “always on”

Slowdown causes– Extra traffic– Stall by buffer overflow– More blocking– Extra coherence

message on some get-shared’s

Run

time

per

Tra

nsac

tion

(N

orm

aliz

ed to

bas

e sy

stem

)

0

10

20

30

40

50

60

70

80

90

100

OLTP JBB APACHE SLASH

OLTP: database transactions (TPC-C on DB2)

JBB: server side java benchmark that models a 3-tier system

APACHE: static web serverSLASH: dynamic web server

(slashdot message posting)

Xu et al. ISCA'03: Flight Data Recorder 29

Uncompressed Compressed

Log Size

1 – 1.33 Second Recording– Buffer: 35 MB (7%); Bandwidth: 25 MB/Second/Processor

Efficient Race Log– Longer recording is possible with better checkpoint scheme

Longer Recording– Using disk can get longer replay: 320 GB disk = ~3 hours

recording

Interrupt, Input, DMA Log

Races log

Checkpoint Log

0

20

40

60

Log

Siz

e (M

B/S

econ

d/P

roce

ssor

)

OLTP JBB APACHE SLASH OLTP JBB APACHE SLASH

Xu et al. ISCA'03: Flight Data Recorder 30

Conclusion

Low Overhead Deterministic Replay– Piggyback MP cache coherence hardware– Modest extra hardware– Modest overhead (less than 2% slowdown)

• Minimal race recording with transitive reduction

Full-system Deterministic Replay– Evaluated with commercial workloads– Full-system recording (including OS, I/O)

Xu et al. ISCA'03: Flight Data Recorder 31

Thank You

Questions?

Xu et al. ISCA'03: Flight Data Recorder 32

Flight Data Recorder vs. ReEnact

Flight Data Recorder ReEnact

Target System CC-NUMA TLS

Deterministic Replay? Yes Yes*

Race-detection? No** Yes

Effective Interval (instructions) >100,000,000 <100,000

Slowdown <2% Avg 5.8%

OS, I/O Yes No (extendable?)

Active during OS & I/O? Yes No

* Need to disable TLS?** Not in the recorder, but in the replayer

Xu et al. ISCA'03: Flight Data Recorder 33

Scalability

More processors, more races log– Not a quadratic increase– e.g. 4p to 16p for 2x more log

Real systems have more I/O– But, also more memory available for log

Xu et al. ISCA'03: Flight Data Recorder 34

Protocol Changes

Get IC count from source processor– WR: Piggyback IC count to DataResponse

msg– WW: Piggyback IC count to DataResponse

msg– RW: Piggyback IC count to InvalidateAck msg

Cache block Writeback– Snooping protocol

• Eager IC update• Extra messages on interconnect• Not on critical path

– Directory based protocol• Lazy IC update• Extra latency for cache misses

Xu et al. ISCA'03: Flight Data Recorder 35

Replayer (Full-system Simulator)

Input data to the replayer– Checkpoint– Execution log– DMA log– I/O log– Exception log

Replay the execution– Load system checkpoint: registers, TLB, etc– Replay the MP execution order in partial order– Replay the I/O and exceptions– Proper device model needed to interrupt system output– Memory inspection support– Step forward/backward (enhanced debugger features)

Xu et al. ISCA'03: Flight Data Recorder 36

Example: False Sharing

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=3

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

P1 P2

Xu et al. ISCA'03: Flight Data Recorder 37

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=3

P1 P2

Example: False Sharing

Xu et al. ISCA'03: Flight Data Recorder 38

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=321(P1,33)

35(P2,14)

P1 P2

Example: False Sharing

Xu et al. ISCA'03: Flight Data Recorder 39

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

P1 P2

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=321(P1,33)

35(P2,14)

Example: False Sharing

Xu et al. ISCA'03: Flight Data Recorder 40

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=321(P1,33)

35(P2,14)

P1 P2

Example: False Sharing

Xu et al. ISCA'03: Flight Data Recorder 41

19 Bneq $r1,$r0,-1

20 Nop

21 Y:=X1

22 Z:=X2

32 X1:=5 15 $r1:=Flag

33 X2:=6 16 Bneq $r1,$r0,-1

34 Flag:=0 17 Nop

15(P1,31)34(P2,15)

18(P1,34)

21(P1,32)

22(P1,33)

31 Flag=1 14 Private2:=2

18 $r1:=Flag35 Private1:=321(P1,33)

35(P2,14)

P1 P2

Example: False Sharing