View
220
Download
3
Tags:
Embed Size (px)
Citation preview
1
SigRace: Signature-Based Data Race Detection
Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas
Computer Science DepartmentUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
*Universidad de Zaragoza, Spain
2
Debugging Multithreaded Programs
“Debugging a multithreaded program has a lot in
common with medieval torture methods”
-- Random quote found via Google search
3
Data Race
• Two threads access the same variable without intervening synchronization and at least one is a write
T1 T2
lock L
unlock Lx++
x++
• Hard to detect and reproduce
• Common bug
4
Dynamic Data Race Detection
• Mainly two approaches
5
Dynamic Data Race Detection
• Mainly two approaches– Lockset: Finds violation of locking discipline
6
Dynamic Data Race Detection
• Mainly two approaches– Lockset: Finds violation of locking discipline
– Happened-Before: Finds concurrent conflicting accesses
7
Happened-Before Approach
8
Happened-Before Approach
[0, 0] [0, 0]
Thread 0 Thread 1
9
Happened-Before Approach
Lock L
[0, 0]
[1, 0]
[0, 0]
Thread 0 Thread 1
10
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
11
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
12
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
13
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[0, 1]
14
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
+
[0, 1]
15
Happened-Before Approach
• Epoch : sync to sync
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
16
Happened-Before Approach
• Epoch : sync to sync
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
17
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
18
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
19
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
20
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
• a, b happened before e
21
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
• a, b happened before e
22
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
• a, b happened before e
• c, d unordered
23
Happened-Before Approach
Lock L
Unlock L
[0, 0]
[1, 0]
[2, 0]
[0, 0]
Thread 0 Thread 1
Lock L
[2, 1]
a
b
c
d
e
• Epoch : sync to sync
• a, b happened before e
• c, d unordered
= x
x =
Data Race
24
Software Implementation
• Need to instrument every memory access– 10x – 50x slowdown
– Not suitable for production runs
25
Hardware Implementation
26
Hardware Implementation
P1 P2
… …C1 C2
27
Hardware Implementation
P1 P2
… …TS TS
C1 C2
28
Hardware Implementation
P1 P2
… …xts1
TS TS
C1 C2
29
Hardware Implementation
P1 P2
… …x xts1 ts2
ts2 …TS TS
C1 C2
WR
30
Hardware Implementation
P1 P2
… …x xts1 ts2
ts2 …
ts2
check
TS TS
C1 C2
WR
31
Limitations of HW Approaches
32
Limitations of HW Approaches
P1 P2
… ……
C1 C2
• Modify cache and coherence protocol
33
Limitations of HW Approaches
P1 P2
… ……
C1 C2
• Modify cache and coherence protocol
• Perform checking at least on every coherence transaction
ts1ts2
ts2
check
34
Limitations of HW Approaches
P1 P2
… …C1 C2
• Modify cache and coherence protocol
• Perform checking at least on every coherence transaction
• Lose detection ability when cache line is displaced or invalidated
35
Our Contributions
• SigRace: Novel HW mechanism for race detection based on signatures– Simple HW
• Cache and coherence protocol are unchanged– Higher coverage than existing HW schemes
• Detect races even if the line is displaced/invalidated
• SigRace finds 150% more injected races than a state-of-the-art HW proposal
– Usable on-the-fly in production runs
36
Outline
• Motivation• Main Idea• Implementation• Results• Conclusions
37
Main Idea
Address Signature + Happened-before
38
Hardware Address Signatures
[Ceze et al, ISCA06]
…P
erm
ute
…
ROB
addressld/st
39
Hardware Address Signatures
40
Hardware Address Signatures
• Logical AND for intersection
• Has false positives but not false negatives
41
Using Signatures for Race Detection
sync
sync
Epoch
42
Using Signatures for Race Detection
sync
sync
Epoch SigTS
43
Using Signatures for Race Detection
Block
• Block is a fixed number of dynamic instructions (not a cache block or basic block or atomic block)
sync
sync
Epoch Sig1TS1
44
Using Signatures for Race Detection
Block
sync
sync
Epoch
Race Detection Module
Sig1TS1
45
Using Signatures for Race Detection
Block
sync
sync
Epoch ØTS1
Race Detection Module
46
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig2TS1
Race Detection Module
47
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig2TS1
Race Detection Module
48
Using Signatures for Race Detection
Block
sync
sync
Epoch
ØTS1
Race Detection Module
49
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig3TS2
Race Detection Module
50
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig3TS2
Race Detection Module
51
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig3TS2
Race Detection Module
52
Using Signatures for Race Detection
Block
sync
sync
Epoch
Sig3TS2 Sig Ո Sig
Race Detection Module
53
On Chip Race Detection Module (RDM)
54
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
Chip
RDM
55
On Chip Race Detection Module (RDM)
P1 P2
T1 R1 W1
Q1 Q2
Chip
RDM
56
On Chip Race Detection Module (RDM)
P1 P2
T1 R1 W1
Q1 Q2
Chip
RDM
57
On Chip Race Detection Module (RDM)
P1 P2
T1 R1 W1
Q1 Q2
Chip
RDM
58
On Chip Race Detection Module (RDM)
P1 P2
T2 R2 W2
Q1 Q2
T1 R1 W1
Chip
RDM
59
On Chip Race Detection Module (RDM)
P1 P2
T2 R2 W2
Q1 Q2
T1 R1 W1
Chip
RDM
60
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
Chip
RDM
61
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
TJ RJ WJ
If T2 & TJ unordered
R2 WJՈ
W2 WJՈ
W2 RJՈ
Else stop
Chip
RDM
62
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1 TJ` RJ` WJ`
If T2 & TJ` unordered
R2 WJ`Ո
W2 WJ`Ո
W2 RJ`Ո
Else stop
Chip
RDM
63
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
TJ” RJ” WJ”
If T2 & TJ” unordered
R2 WJ”Ո
W2 WJ”Ո
W2 RJ”Ո
Else stop
Chip
RDM
64
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
TJ”` RJ”` WJ”`
If T2 & TJ”` unordered
R2 WJ”`Ո
W2 WJ”`Ո
W2 RJ”`Ո
Else stop
Chip
RDM
65
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
TJ”` RJ”` WJ”`
If T2 & TJ”` unordered
R2 WJ”`Ո
W2 WJ”`Ո
W2 RJ”`Ո
Else stop
Chip
RDM
Done in Background
66
On Chip Race Detection Module (RDM)
P1 P2
Q1 Q2
T2 R2 W2T1 R1 W1
TJ”` RJ”` WJ”`
If T2 & TJ”` unordered
R2 WJ”`Ո
W2 WJ”`Ո
W2 RJ”`Ո
Else stop
Chip
RDM
False Positives
67
Re-execution
• Needed for
– Discard if a false positive
– Identify the accesses involved
68
Support for Re-execution
• Save synchronization history in TS Log
– Timestamp at sync points
• Log inputs (interrupts, sys calls, etc)
• Take periodic checkpoints: ReVive [Prvulovic et al, ISCA02]
69
Modes of Operation
• Normal Execution
• Re-Execution: Bring the program to just before the race
• Race Analysis: Pinpoint the racy accesses or discarding the false positive
70
SigRace Re-execution Mode
• Can be done in another machine
71
SigRace Re-execution Mode
• Can be done in another machine
• Periodic checkpoint of memory state
sync
sync
sync
sync
sync
checkpointT0 T1 T2
72
SigRace Re-execution Mode
• Can be done in another machine
• Periodic checkpoint of memory state
sync
sync
sync
sync
sync
checkpoint
s1
s2Data Race
T0 T1 T2
73
SigRace Re-execution Mode
• Can be done in another machine
• Periodic checkpoint of memory state
sync
sync
sync
sync
sync
checkpoint
s1
s2Data Race
Ո
Conflict Sig Conflict Sig
T0 T1 T2
74
SigRace Re-execution Mode
• Can be done in another machine
• Periodic checkpoint of memory state
sync
sync
sync
sync
sync
checkpoint
s1
s2Data Race
Ո
Conflict Sig Conflict Sig
checkpointT0 T1 T2 T0 T1 T2
75
SigRace Re-execution Mode
• Can be done in another machine
• Periodic checkpoint of memory state
sync
sync
sync
sync
sync
checkpoint
s1
s2Data Race
Ո
Conflict Sig Conflict Sig
sync
sync
sync
sync
sync
checkpointT0 T1 T2 T0 T1 T2
Use the TS Log
76
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
77
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
78
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
Conflict SigConflict Sig
ldՈ
79
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
Conflict SigConflict Sig
ldlog
Ո
80
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
Conflict SigConflict Sig
ldlog
sync sync
Ո
81
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
log
sync sync
log
82
SigRace Analysis Mode
sync
sync
sync
sync
sync
checkpointT0 T1 T2
log
sync sync
log
• Pinpoints racy addresses or,
• Identifies and discards false positives
83
Outline
• Motivation• Main Idea• Implementation• Results• Conclusions
84
New Instructions
• collect_on– Enable R and W address collection in current thread
85
New Instructions
• collect_on– Enable R and W address collection in current thread
• collect_off– Disable R and W address collection in current thread
86
New Instructions
• sync_reached
87
New Instructions
• sync_reached– Dump TS, R and W
P
RDM
Network
TS R W
88
New Instructions
• sync_reached– Dump TS, R and W
– Clear signatures
P
RDM
Network
TS Ø Ø
89
New Instructions
• sync_reached– Dump TS, R and W
– Clear signatures
– Update TS
P
RDM
Network
TS` Ø Ø
90
Modifications in Sync Libraries
91
Modifications in Sync Libraries
• Synchronization objectvariable timestamp
92
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
93
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached
RDM
Network
TS R WP
94
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached
RDM
Network
TS Ø ØP
95
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached
RDM
Network
TS Ø ØP
96
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached$1.timestamp = TS lock TS
97
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached$1.timestamp = TS
98
Modifications in Sync Libraries
• Synchronization object
• Unlock macro
variable timestamp
UNLOCK (‘{
}’)
unlock($1.lock)
sync_reached$1.timestamp = TS
AppendtoTSLog(TS) TS LogTS
99
Modifications in Sync Libraries
• Synchronization object
• Lock macro
variable timestamp
LOCK (‘{
}’)
lock($1.lock)
…
…
100
Modifications in Sync Libraries
• Synchronization object
• Lock macro
variable timestamp
LOCK (‘{
}’)
lock($1.lock)
…
TS = GenerateTS (TS,$1.timestamp)
lock timestamp
…
101
Modifications in Sync Libraries
• Synchronization object
• Lock macro
variable timestamp
LOCK (‘{
}’)
lock($1.lock)
…
TS = GenerateTS (TS,$1.timestamp)
lock timestamp
…
Transparent to Application Code
102
Other Topics in Paper
• Easy to virtualize
• Queue Overflow
• Detailed HW structures
103
Outline
• Motivation• Main Idea• Implementation• Results• Conclusions
104
Experimental Setup
• PIN – Binary Instrumention Tool• Default parameters
• Benchmarks: SPLASH2, PARSEC
– # of proc: 8
– Signature size: 2 Kbits
– Block size: 2,000 ins
– Queue size: 16 entries
– Checkpoint interval: 1 Million ins
105
Race Detection Ability
• Three configurations
– SigRace Default
– SigRace Ideal : Stores every signature between 2 checkpoints
– ReEnact [Prvulovic et al, ISCA03]: Cache based approach with timestamp per word
106
Race Detection Ability
App Ideal
SigRace
Default SigRce
ReEnact
Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster
13 12 13
Total 95 90 70
107
Race Detection Ability
• More coverage than ReEnact
App Ideal
SigRace
Default SigRce
ReEnact
Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster
13 12 13
Total 95 90 70
108
Race Detection Ability
• More coverage than ReEnact
• Coverage comparable to ideal configuration
App Ideal
SigRace
Default SigRce
ReEnact
Cholesky 16 16 16Barnes 11 11 6Volrend 27 27 18Ocean 1 1 1Radiosity 15 15 12Raytrace 4 4 3Water-sp 8 4 2Streamcluster
13 12 13
Total 95 90 70
109
Injected Races
• Removed one dynamic sync per run• Each application runs 25 times with diff sync
elimination
110
Injected Races
111
Injected Races
• More overall coverage than ReEnact
112
Injected Races
• More overall coverage than ReEnact– 150% more coverage
113
Injected Races
• More overall coverage than ReEnact– 150% more coverage
114
Conclusions
• Proposed SigRace: – Simple HW
• Cache and coherence protocol are unchanged– Higher coverage than existing HW schemes
• Detect races even if the line is displaced/invalidated
– Usable on-the-fly in production runs
• SigRace finds 150% more injected races than word-based ReEnact
115
SigRace: Signature-Based Data Race Detection
Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas
Computer Science DepartmentUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
*Universidad de Zaragoza, Spain
116
Back Up Slides
117
Execution Overhead
• No overhead in generating signatures (HW)• Additional instructions are negligible• Main overheads
– Checkpointing (ReVive – 6.3%)
– Network traffic (63 bytes per 1000 ins - compressed)
– Re-execution (depends on false positives & race position)• Can be done offline
118
Network Traffic Overhead
63
[ 1 cache line
119
Re-execution Overhead
• Instructions re-executed until the first true data race is analyzed are shown as overhead
• In this process, it may also encounter many false positive races
• Instructions re-executed to analyze only the true race are shown as true overhead
• Instructions re-executed to filter out the false positives are shown as false overhead
120
Re-execution Overhead
22%
Modest overhead
121
False Positives
• Parallel bloom filters with H3 hash function
1.57%
Low False Positive
122
Virtualization
123
Virtualization
• RDM uses as many queues as the number of threads
• Timestamp is accessed by thread id
• Thread id remains same even after migration
• Timestamps, flags, conflict signature are saved and restored at context switch
• RDM intersects incoming signatures against all other threads’(even inactive ones) signatures
• Threads can be re-executed without any scheduling constraints
124
Scalability
• For small # of proc., scalability is not a problem
• The operation of RDM can be pipelined– Simple repetitive operation
• Network traffic (compressed message) around 63Bytes/thousand ins
• Checkpoint is an issue.