Upload
reuben
View
49
Download
0
Embed Size (px)
DESCRIPTION
Offline Symbolic Analysis to Infer Total Store Order. Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , and Zijiang James Yang* University of Michigan, Ann Arbor † Western Michigan University *. Deterministic Replay. What is deterministic replay? - PowerPoint PPT Presentation
Citation preview
- 1 -
Dongyoon Lee†, Mahmoud Said*, Satish Narayanasamy†, and Zijiang James Yang*
University of Michigan, Ann Arbor †
Western Michigan University *
Offline Symbolic Analysis toInfer Total Store Order
- 2 -
Deterministic Replay
What is deterministic replay?• Record and reproduce non-deterministic events• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies
Deterministic replay uses1) Debugging • Reproducing concurrency bugs• Time-travel debugging
2) Heavyweight dynamic analysis3) Forensics
- 3 -
Traditional Deterministic Replay Systems
Write
ReadRead
Log shared memory dependencies
Checkpoint Memory and Register State
Log non-deterministic program input Interrupts, I/O values, DMA, etc.
Thread 1 Thread 2 Thread 3
- 4 -
Recording Shared-Memory Dependencies
Problem• Need to monitor every memory operation
Software-based replay systems• PinSEL (Intel), iDNA (Microsoft)
• ODR (Berkeley), PRES (UCSD)
• Respec/DoublePlay (Michigan)
Hardware-based replay systems• RTR/ReRun (Wisconsin)
• Strata (UCSD)
• DeLorean (UIUC)
• LReplay (CAS)
• Timetraveler (Purdue)
→ 10-100x→ Replay is not guaranteed→ 2x throughput overhead
→ Complex because precise recorder
→ Only few systems support relaxed consistency model
: Total Store Order
- 5 -
Hardware Complexity in Previous Solutions
Support for precisely logging shared-memory dependencies• Monitor and log cache coherence messages
Support for Total Store Order (TSO) model• Detect sequential consistency (SC) violation • Log SC-violating loads
Complexity• Require invasive changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64
[Narayanasamy et al. ICCD’06]
RTR [Xu et al. ASPLOS’06]
- 6 -
Our Approach
Complexity-effective solution [MICRO’09]
• Do NOT record shared-memory dependencies at all• Infer shared-memory dependencies offline• Using Satisfiability Modulo Theory (SMT) solver
Contribution• Find a causal order compliant to Total Store Order• Bound search space under TSO
- 7 -
System Overview
Write
ReadRead
Log shared memory dependency
Checkpoint Memory and Registers
Log non-deterministic program inputInterrupts, I/O values, DMA, etc.
BugNet [ISCA’05]Load-based Hardware Recorder
Satisfiability-Modulo-Theory (SMT) solver reconstructs thread interleaving offline
Checkpoint Registers
Thread 1 Thread 2 Thread 3
- 8 -
Roadmap
• Motivation• Background: Load-based logging architecture [ISCA’05]
Write Read
Read
BugNet [ISCA’05]Load-based Hardware Recorder
Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline
Checkpoint Registers
T1 T2 T3
• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion
- 9 -
Insight• Recording initial register state and values of loads is sufficient for
deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay
Optimization• Record a load only if it is the first access to a memory location
Processor support• Recording data fetched on a cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK
Load-based Logging Architecture [Narayanasamy et al, ISCA’05][Lee et al. MICRO’09]
- 10 -
Recording Cache Miss Data (First Loads)
ExecutionTime
Log file
First Load
Checkpoint• Register Values• Program Counter
Load A = 0
Load A = 0
<cnt1, 0>
Load B = 5 <cnt2, 5>
Store C = 1
On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically
Cache Miss
Checkpoint
Record cache misses• <Memory count , Data>• Implicitly capture first loads
<cnt3, 0>
Deterministic Replay• Input and output (including address) of all instructions are replayed
- 11 -
Cache Miss Logging for Multithreaded Programs
Insight• Load-based recorder (initial register state + loads) for each thread is
sufficient for replaying that thread=> Recording cache miss data is sufficient for multithreaded programs=> No additional hardware support required for recording dependencies
Reason • Load dependent on a remote write will cause a cache miss to ensure
coherence=> Implicitly records load values dependent on remote writes
Effect• Can replay each thread in isolation (independent of other threads)
regardless of underlying consistency model
- 12 -
Replaying Each Thread Independently
Proc 1 Proc 2
Load A=0
Load A=0
Load A=
Store A=1
Invalidation
Cache Coherence• Invalidates cache block to gain exclusive permission
Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism
(1st, 0)
(3rd, 1)
Proc 1 LOG
(1st, 0)
Proc 2 LOG
Cache Miss
Cache BlockInvalidated
1Replay each threadindependent of others
- 13 -
Shared Memory Dependency
Thread 1 Thread 2
Store
Load
Store
Load
Store
Store
Store
SMT Solver for finding shared memory dependencies
Billions of instructions• Offline analysis may not scale
Final State : A, B, C
Strata for bounding search space
A
B
C
A
A
B
C?Old value x New value
Address
Store AOld value
Address
New value
- 14 -
Roadmap
• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis
Write Read
Read
BugNet[ISCA’09]Load-based Hardware Recorder
Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline
Checkpoint Registers
T1 T2 T3
• Bounding Search Space• Evaluation• Conclusion
- 15 -
Offline Symbolic Analysis
Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….
Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Final State
- 16 -
Offline Symbolic Analysis
Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….
Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Final State
`
- 17 -
Encoding Coherence Constraints
Proc 1 Proc 2
x1
x2
x3
x4
x5
xFinal
Coherence Constraints( M→old == M→prev→new)
X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND
Old value x New value Address
- 18 -
Multiple Memory Locations
x1
x2
x3
x 4
x 5
xFinal
Coherence Constraints( M→old == M→prev→new)
X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y2 < Y3 AND
:
y
y1
2 y3
yFinal
Proc 1 Proc 2
Old value x New value Address
- 19 -
Offline Symbolic Analysis
Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….
Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]
x1
x2
x 3
x 4
x 5
y
y1
2 y3
Final State
`
- 20 -
Encoding Sequential Consistency Constraints
Coherence Constraints( M→old== M→prev→new)
X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND
:
Memory Model Constraints(under Sequential Consistency)
P1 : Y1 < X1 < X2 < Y2 AND P2 : X3 < X4 < X5 < Y3 AND
x1
x2
x3
x 4
x 5
xFinal
y
y1
2 y3
yFinal
Proc 1 Proc 2
- 21 -
Alice = 1 Bob = 1
r1 = r2 = 0, r3 = r4 = 1
P1 P2
r1 = Bob r2 = Alice
Alice = Bob = 0
Alice = Bob = Charlie = 0P1 P2
Alice = 1 Bob = 1
r1 = r2 = 0, r3 = 1r1 = Bob r2 = Alice
Total Store Order Relaxations
Store→LoadAlice = 1 Bob = 1
r1 = Bob r2 = Alicer1 = r2 = 0
P1 P2Alice = Bob = 0
Load →Load
LoadSB-Hit→Load
r3 = CharlieCharlie = 1
r3 = Alice r4= Bob
Stores Store Buffer Hit
Log memory countsof LoadSB-Hit
SC
TSO
SC
TSO
SC
TSO
1 1
00
- 22 -
Encoding Total Store Order Constraints
x1
x2
x3
x 4
x 5
xFinal
y1
y2 y3yFinal
P1 : Y1 < X1 < X2 < Y2 P2 : X3 < X4 < X5 < Y3
AND X1 < Y2 AND X3 < Y3
Stores Store Buffer Hit
Store
Load
LoadSB-Hit
Load
Derived TSO schedule
x1
x2
x3
x 4
xFinal
1y
y Final
x 5
Encoded TSO constraints SMT Solver
y2 y3
- 23 -
Roadmap
• Motivation• BugNet• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion
- 24 -
Bounding Search Space using Strata
Proc 1 Proc 2
N msgs
N msgs
Final State
hint 1 hint 2
hint 3 hint 4
Strata Region 3
Strata Region 2
Strata Region 1
Final State
Initial State
Final State
Initial State
Final State
Strata• After every N broadcasted
coherence messages, each processor records “Strata Hints”
Strata regions are ordered
SMT Solver• Analyzes one Strata region at a time• Starts from the last Strata region • Initial state of a Strata region = Final state of preceding region
Coherencemessages
Stratum
Stratum
- 25 -
… remains in store-buffer
… executed, not committed yet
Strata Hint = Mem. Count + I-Store CountStrata hint• Every processor records• memory count• number of in-flight stores in the store buffer (called I-Store)
• Reconstruct Stratum offline based on the memory count • Move the pending stores and dependent loads to the next Strata region
{1,1} {1,1}Log{1,1}
Log{1,1}
< Recording Strata Hints >
Alice = 1 Bob = 1
r1 = Bob r2 = Alice
r1 = r2 = 0
P1 P2
r1 = Bob r2 = Alicer1 = r2 = 0
Alice = 1 Bob = 1P1 P2
I-Store Cnt.
Mem Cnt. Mem Cnt.
I-Store Cnt.Alice = Bob = 00
0
0
0
1
1 1
1
< Reconstructing Strata Offline>
- 26 -
Filtering Local, Read-only & Cache-hit Accesses
Load A
Store BLoad B
Store A
Load C
Load CLoad C Load C
Load C
Strata Region
Thread 1 Thread 2Local accesses• No shared-memory dependencies
Read-only accesses• Any order is valid
Intermediate cache-hit accesses• No interleaved accesses on successive
cache hits
Effectiveness• < 1% of memory accesses remain
Store DStore DLoad DStore D
… Miss… Hit… Hit… Hit
Store D
Load D
- 27 -
Roadmap
• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis• Evaluation• Conclusion
- 28 -
EvaluationSimulation Framework• Simics: simulate multi-processor execution (2, 4, 8,16 cores) • Modified FeS2 : simulate Total Store Order• Fast-forward up to known synchronization points• Simulated for 500 million instructions
Benchmarks• SPLASH2: barnes, fmm, ocean• PARSEC2.0: blackscholes, bodytrack, x264• SPEComp: wupwise, swim• Servers: apache, mysql
Offline Symbolic Analysis• Yices SMT solver [Dutertre and Moura CAV’06]
- 29 -
Program Input and LoadSB-Hit Log Size
barne
sfm
moc
ean
black
scho
les
body
track
x264
wupwise
swim
apac
hemys
ql
avera
ge0
100
200
300
400
500
600
Log
Size
(MB/
sec)
192MB/sec
On average, 192 MB/sec (8 threads, TSO)• Dominated by cache miss logging• 2.45% of load instructions read their values from the store buffer
Program Input (Data & Instr. Cache Misses) LoadSB-Hit
- 30 -
Strata Hint Log Size
barne
sfm
moc
ean
black
scho
les
body
track
x264
wupwise
swim
apac
hemys
ql
avera
ge1
10
100
1000
10000
Stra
ta L
og S
ize
(KB/
sec)
On average, 1.2MB/sec (8 threads, TSO, b-bound 10)• 15% increase over SC to log the number of pending stores (I-Store)
No hardware support for precise shared-memory dependency logging
1.2MB/sec
SC TSO (with I-Stores)
- 31 -
Offline Analysis Overhead
barne
sfm
moc
ean
black
scho
les
body
track
x264
wupwise
swim
apac
hemys
ql
avera
ge10
100
1000
10000
Offl
ine
anal
ysis
ove
rhea
d(s
ecs/
one
seco
nd o
f pro
g. e
xec.
)
On average, 260 seconds/sec (8 threads, TSO, b-bound 10)• Overall, 30% more efficient than SC
One time cost before replay
260secs
SC TSO
- 32 -
Performance Overhead etc.
Performance Overhead• Performance is mainly dominated by cache miss logging• Additional logging overhead for LoadSBH and I-Store is small
• On average, <1% slowdown in IPC
Paper includes more results on• Comparisons between different Strata bounding schemes• Effectiveness of cache hit filtering• Scalability results on different number of processors• …
- 33 -
Conclusion
Reproducing concurrent executions is a huge challenge
Our proposal: a complexity-effective hardware solution for TSO• Record cache miss data and store buffer hits• Record Stratum (memory count + store buffer count) to bound search
space• Determine shared memory dependencies using offline symbolic analysis
Result (8 threads, TSO)• Performance overhead: less than 1%• Total log size: 193 MB/sec (Program input + Strata hints)• Offline analysis: 260 seconds/sec (30% more efficient than SC)
- 34 -
Thank you
- 35 -