Upload
ariel-cooper
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
1
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications
Hyoun Kyu Cho1, Tipp Moseley2, Richard Hank2,
Derek Bruening2, Scott Mahlke1
1University of Michigan 2Google
2
Datacenter Applications
• In 2010, US Datacenters spent 70~90 billion kWh*
• Datacenter application performance is critical• Profiling can help
http://googleblog.blogspot.com
*[Koomey`11]
3
Challenges for Datacenters• Need to run on live traffic
• Difficult to isolate
• Overheads• Value profiling 3.8x slowdown1
• Path profiling 31%, edge profiling 16%2
• Binary management• Many programs, multiple
versions
Traditional Profiling
Source Code
Instrumented Binary
Input Data
Instrumentation Build
TrainingRun
Profile Data
1[Calder`99] 2[Ball`96]
4
Continuous profiling infrastructure for datacenters
Negligible overhead• Sampling based
• Aggregated profiling overhead less than 0.01% Limitations
• Heavily rely on Performance Monitoring Units
• Limited flexibility and portabiliity
[Ren et al.`10]
Google-Wide Profiling
5
Unified profiling infrastructure for datacenters• Flexible types of profile data
• Portable across heterogeneous datacenter While maintaining
• Low overhead
• Does not burden binary management
Goals
Sampling Dynamic Binary Instrumentation
6
Instrumentation Sampling
hardware
operating system
application
[Bruening`04]
dispatchinstrumentation
engineclient
code cacheDynamoRIO
context switch
6
Instrumentation Sampling
hardware
operating system
application
shep
herd
ing
thre
ad
start profiling
dispatchinstrumentation
engineclient
code cachestopprofiling
Unbounded profiling periods due to fragment linking
Latency degradation due to initial instrumentation
Multi-threade programs
7
Problems with Basic Implementation
9
S/W Code Cache Pre-population
hardware
operating system
application
shep
herd
ing
thre
ad
dispatchinstrumentation
engine
clientcode cache
Still have latency degradation for intial instrumentation phases
Sampling makes it possible to miss thread operations
Forces Instant Profiling’s signal handler for every thread
Enumerates all threads and sends profiling start signal to each thread
10
Multithreaded Program Support
6-core Intel Xeon 2.67GHz w/ 12MB L3 12GB main memory Linux kernel 2.6.32 gcc 4.4.3 w/ -O3 SPEC INT2006, BigTable, Web search Edge profiling client
11
Experimental Setup
12
Naïve Edge Profiling
40
0.p
erl
be
nch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
46
2.li
bq
ua
ntu
m
46
4.h
26
4re
f
47
3.a
sta
r
we
b s
ea
rch
big
tab
le
a.m
ea
n
0
5
10
15
20
25
30
35
40
45
50
Slo
wd
ow
n
13
Profiling Overhead4
00
.pe
rlb
en
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
46
2.li
bq
ua
ntu
m
46
4.h
26
4re
f
47
3.a
sta
r
we
b s
ea
rch
big
tab
le
a.m
ea
n
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
2ms/4s 1ms/1s 2ms/1s 4ms/1s 2ms/250ms
No
rma
lize
d E
xe
cu
tio
n T
ime
14
S/W Code Cache Prepopulation
0 1 2 3 4 5 6 7 8 90
500000
1000000
1500000
2000000
2500000
3000000
3500000
w/ pre-population w/o pre-population
Sampling Phases
Cu
mu
lati
ve
Nu
mb
er
of
Sa
mp
les
15
Profiling Accuracy4
00
.pe
rlb
en
ch
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
46
2.li
bq
ua
ntu
m
46
4.h
26
4re
f
47
3.a
sta
r
we
b s
ea
rch
big
tab
le
a.m
ea
n
0
10
20
30
40
50
60
70
80
90
100
2ms/4s 1ms/1s 2ms/1s 4ms/1s 2ms/250ms
Pro
filin
g A
cc
ura
cy
16
Asymptotic Accuracy
0 20 40 60 80 100 120 1400
10
20
30
40
50
60
70
80
90
100
bigtable web search
Sampling Phases
Cu
mu
lati
ve A
ccu
racy
Low-overhead, portable, flexible profiling needed
Instant Profiling • Combines sampling and DBI
• Pre-populates S/W code cache
• Tunable tradeoff between overhead and information
• Provides eventual profiling accuracy
Less than 5% overhead, more than 80% accuracy for naïve edge profiling client
17
Conclusion