Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Demand-Driven Software Race
Detection using Hardware
Performance CountersPerformance Counters
Joseph L. Greathouse†, Zhiqiang Ma‡, Matthew I. Frank‡
Ramesh Peri‡, Todd Austin†
†University of Michigan ‡Intel Corporation
CSCADS
Aug 2, 2011
In spite of proposed hardware solutions
Concurrency Bugs Still Matter
Bulk Memory
Commits
2
Commits
TRANSACTIONAL
MEMORYAMD
ASF
?
Sun
Rock
?
Concurrency Bugs Matter NOW
if(ptr==NULL)
if(ptr==NULL)
TIME
Thread 2mylen=large
Thread 1mylen=small
len2=thread_local->mylen;
ptr=malloc(len2);Nov. 2010 OpenSSL Security Flaw
3
memcpy(ptr, data2, len2)
ptrLEAKED
TIME
∅
len1=thread_local->mylen;
ptr=malloc(len1);
memcpy(ptr, data1, len1)
Nov. 2010 OpenSSL Security Flaw
This Talk in One Sentence
Speed up software race detection
with existing hardware support.with existing hardware support.
4
Software Data Race Detection
� Add checks around every memory access
� Find inter-thread sharing events
� Synchronization between write-shared
accesses?
� No? Data race.
5
Thread 2mylen=large
Thread 1mylen=small
if(ptr==NULL)
len1=thread_local->mylen;
ptr=malloc(len1);
memcpy(ptr, data1, len1)
Example of Data Race Detection
ptr write-shared?Interleaved
Synchronization?
TIME
if(ptr==NULL)
len2=thread_local->mylen;len2=thread_local->mylen;
ptr=malloc(len2);ptr=malloc(len2);
memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)
6
TIME
SW Race Detection is Slow
150
200
250
300
Race Detector Slowdown (x) Phoenix PARSEC
7
0
50
100
his
tog
ram
km
ea
ns
line
ar_
reg
rC
ma
trix
_m
ulC
pca
str
ing_
ma
tch
wo
rd_
co
un
t
Ge
oM
ea
n
bla
cksch
ole
s
bo
dytr
ack
face
sim
ferr
et
fre
qm
ine
raytr
ace
sw
ap
tio
ns
flu
ida
nim
ate
vip
s
x2
64
ca
nn
ea
l
de
du
p
str
ea
mclu
sC
Ge
oM
ea
n
Race Detector Slowdown (x
Goal of this Work
Accelerate Software Data Race Detection
Technique #1: Making it Fast
Demand-Driven Data Race Detection
Technique #2: Keeping it Real
Find sharing events with existing HW
8
Inter-thread Sharing is What’s Important
“Data races ... are failures in programs that access and
update shared data in critical sections” – Netzer & Miller, 1992
if(ptr==NULL)
len1=thread_local->mylen;
ptr=malloc(len1);
memcpy(ptr, data1, len1)
Thread-local data
NO SHARING
Shared data
NO INTER-THREAD
TIME
9
if(ptr==NULL)
memcpy(ptr, data1, len1)
len2=thread_local->mylen;len2=thread_local->mylen;
ptr=malloc(len2);ptr=malloc(len2);
memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)
NO INTER-THREAD
SHARING EVENTS
TIME
Very Little Inter-Thread Sharing
Phoenix PARSEC
40
50
60
70
80
90
100
Sharing Events
1.5
2
2.5
3
Sharing Events
10
0
10
20
30
40
% Write-Sharing Events
0
0.5
1
% Write-Sharing Events
Technique 1: Demand-Driven Analysis
Multi-threadedApplication
SoftwareRace Detector
11
Local
Access
Inter-thread
sharing
Inter-thread Sharing Monitor
Inter-thread Sharing Monitor
� Check each memory op. for write-sharing
� Signal software race detector on sharing
� Possible to do in software
+ Can be built now with instrumentation
– Slow. May take as long as race detection
12
� Follow read/write sets of threads
Fast user-level faults
Sharing Monitor
Thread 1
WRITE Y
Ideal Hardware Sharing Detector
Thread 2
READ Y
Thread 1
WRITE Y
T1
R: ∅
W: ∅
T2
R: ∅
W: ∅
Thread 2
READ Y
T1
R: ∅
W: {Y}
W->R
Sharing
� Fast user-level faults
13
Multi-threadedApplication
SoftwareRace Detector
Inter-thread Sharing Monitor
Limitations of Existing Hardware
� Fast faults
� Solution: Enable detector for long periods of time
Multi-threadedApplication
SoftwareRace Detector
NO
� Read/write sets
� Solution:
14
Inter-thread Sharing Monitor
NO
SHARING
Technique 2: Hardware Sharing Detector
� Hardware Performance Counters
� Interrupt on cache coherency events
� Intel’s HITM event: W→R Data Sharing
S
M
S
IHITM
� Limitations of this method:
� SMT sharing can’t be counted
� Cache eviction
� Others in paper
15
M IHITM
Demand-Driven Analysis on Real HW
Execute
Instruction
Disable HITM AnalysisNO
NO
16
SW Race
Detection
Enable
Analysis
AnalysisInterrupt?
Sharing
Recently?
Enabled?
NOYES
YES
YES
Experimental Evaluation
� Modified Intel Inspector XE Race Detector
� Linux on 4-core Core i7, no Hyper-Threading
� Performance Tests:
� Phoenix Suite� Phoenix Suite
� PARSEC
� Accuracy Tests:
� Phoenix Suite
� PARSEC
� Pre-release version of RADBench
17
Simulation
Performance Difference
150
200
250
300
Race Detector Slowdown (x) Phoenix PARSEC
18
0
50
100
his
tog
ram
km
ea
ns
line
ar_
reg
rC
ma
trix
_m
ulC
pca
str
ing_
ma
tch
wo
rd_
co
un
t
Ge
oM
ea
n
bla
cksch
ole
s
bo
dytr
ack
face
sim
ferr
et
fre
qm
ine
raytr
ace
sw
ap
tio
ns
flu
ida
nim
ate
vip
s
x2
64
ca
nn
ea
l
de
du
p
str
ea
mclu
sC
Ge
oM
ea
n
Race Detector Slowdown (x
Performance Increases
8
10
12
14
16
18
20
driven Analysis
Speedup (x)
Phoenix PARSEC
51x
19
0
2
4
6
8
his
tog
ram
km
ea
ns
line
ar_
reg
rC
ma
trix
_m
ulC
pca
str
ing_
ma
tch
wo
rd_
co
un
t
Ge
oM
ea
n
bla
cksch
ole
s
bo
dytr
ack
face
sim
ferr
et
fre
qm
ine
raytr
ace
sw
ap
tio
ns
flu
ida
nim
ate
vip
s
x2
64
ca
nn
ea
l
de
du
p
str
ea
mclu
sC
Ge
oM
ea
n
Demand-driven Analysis
Speedup (x
Demand-Driven Analysis Accuracy
8
10
12
14
16
18
20
driven Analysis
Speedup (x)
1/1 2/4 3/3 4/4 3/3 4/4 4/42/4 4/4 4/42/4
Accuracy vs.
Continuous Analysis:
20
0
2
4
6
8
his
tog
ram
km
ea
ns
line
ar_
reg
rC
ma
trix
_m
ulC
pca
str
ing_
ma
tch
wo
rd_
co
un
t
Ge
oM
ea
n
bla
cksch
ole
s
bo
dytr
ack
face
sim
ferr
et
fre
qm
ine
raytr
ace
sw
ap
tio
ns
flu
ida
nim
ate
vip
s
x2
64
ca
nn
ea
l
de
du
p
str
ea
mclu
sC
Ge
oM
ea
n
Demand-driven Analysis
Speedup (x)
Continuous Analysis:
97%
Future Directions
� Better Performance
� Fast user-level faults
� Application specific hardware
� More Accuracy
� Better performance counters� Better performance counters
� Inform SW on cache evictions/misses
� Smooth transition to ideal hardware
� Combine sampling & demand-driven analysis
21