Upload
werner
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Characterizing Multi-threaded Applications based on Shared-Resource Contention. Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa Department of Computer Science University of Virginia. Motivation. The number of cores doubles every 18 months Expected: Performance number of cores - PowerPoint PPT Presentation
Citation preview
ISPASS 2011
Characterizing Multi-threaded Applications based on
Shared-Resource Contention
Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa
Department of Computer ScienceUniversity of Virginia
1
MotivationThe number of cores doubles every 18 monthsExpected: Performance number of coresOne of the bottlenecks is shared resource contention
For multi-threaded workloads, contention is unavoidable
To reduce contention, it is necessary to understand where and how the contention is created
2
Shared Resource Contention in Chip-Multiprocessors
Intel Quad Core Q9550
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Front -Side Bus
3
Application 1 Thread
Application 2 Thread
Scenario 1 Multi-threaded applicationsWith co-runner
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
4
Application 1 Thread
Application 2 Thread
Without co-runner
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application Thread
5
Scenario 2Multi-threaded applications
Shared-Resource Contention
Intra-application contentionContention among threads from the same application
(No co-runners)
Inter-application contentionContention among threads from the co-running
application
6
ContributionsA general methodology to evaluate a multi-threaded
application’s performance Intra-application contention Inter-application contentionContention in the memory-hierarchy shared resources
Characterizing applications facilitates better understanding of the application’s resource sensitivity
Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7
OutlineMotivationContributionsMethodologyMeasuring intra-application contentionMeasuring inter-application contentionRelated WorkSummary
8
Methodology
9
Designed to measure both intra- and inter-application contention for a targeted shared resourceL1-cache, L2-cacheFront Side Bus (FSB)
Each application is run in two configurationsBaseline: threads do not share the targeted resourceContention: threads share the targeted resource
Multiple number of targeted resourceDetermine contention by comparing performance
(gathering hardware performance counters’ values)
OutlineMotivationContributionsMethodologyMeasuring intra-application contention (See paper)Measuring inter-application contentionRelated WorkSummary
10
L1-cache
Baseline Configuration
Contention Configuration
Measuring inter-application contention
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
11
Measuring inter-application contentionL2-cache
Baseline Configuration
Contention Configuration
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
C0
C1
C2
C3
L2 L2
Memory
L1 L1L1 L1
12
Measuring inter-application contentionFSB
Baseline Configuration
Memory
C0
C2
C4
C6
L2 L2
L1 L1L1 L1
C1
C3
C5
C7
L2 L2
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
13
Measuring intra-application contentionFSB
Contention Configuration
Memory
C0
C2
C4
C6
L2 L2
L1 L1L1 L1
C1
C3
C5
C7
L2 L2
L1 L1L1 L1
Application 1 Thread
Application 2 Thread
14
PARSEC Benchmarks
15
Application Domain Benchmark(s)Financial Analysis Blackscholes (BS)
Swaptions (SW)Computer Vision Bodytrack (BT)Engineering Canneal (CN)Enterprise Storage Dedup (DD)Animation Facesim (FA)
Fluidanimate (FL)Similarity Search Ferret (FE)Rendering Raytrace (RT)Data Mining Streamcluster (SC)Media Processing Vips (VP)
X264 (X2)
Experimental platformPlatform 1: Yorkfield
Intel Quad core Q955032 KB L1-D and L1-I
cache6MB L2-cache2GB MemoryCommon FSB
C0
L2 cache
Memory
L1 cache
Memory Controller Hub (Northbridge)
FSB
MB
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C1
C2
C3
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
1616
Tanima Dey
Experimental platform
Memory
Memory Controller Hub (Northbridge)FSB
MB
FSB
C0
L2 cache
L1 cache
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C2
C4 C6
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
C1
L2 cache
L1 cache
FSB interface
L2 cache
L2 HW-PF
FSB interface
L2 HW-PF
L1 HW-PF
C3
C5
C7
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
L1 cache
L1 HW-PF
Platform 2: Harpertown
1717
18
Performance AnalysisInter-application contention
For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100
PerformanceBasei
Absolute performance difference sum
APDS = Σ abs ( PercentPerformanceDifferencei )
Inter-application contentionL1-cache – for Streamcluster
19
Bla
cksc
hole
s
Bod
ytra
ck
Can
neal
Ded
up
Face
sim
Ferr
et
Flui
dani
mat
e
Ray
trace
Swap
tions
Vips
X264
-8
-6
-4
-2
0
2
4
6
8Inter-application L1-cache Contention
Co-running benchmarks
Perfo
rman
ce D
iffer
ence
(%)
Inter-application L1-cache contention Streamcluster
20
Inter-application L1-cache Contention
-8-6-4-202468
Bla
cksc
hole
s
Bod
ytra
ck
Can
neal
Ded
up
Face
sim
Ferr
et
Flui
dani
mat
e
Ray
trace
Stre
amcl
uste
r
Swap
tions
Vips
X264
Co-running benchmarks
Perfo
rman
ce D
iffer
ence
(%)
21
Inter-application contention
21
L1-cache
Inter-application contention
22
L2-cache
Inter-application contentionFSB
23
Characterization
24
Benchmarks
L1-cache L2-cache FSB
Blackscholes
none none none
Bodytrack inter inter intraCanneal intra inter intraDedup inter intra, inter intra, interFacesim inter inter intraFerret intra intra, inter intraFluidanimate
inter inter intra
Raytrace none none intraStreamcluster
inter inter intra
Swaptions none none noneVips intra inter interX264 inter intra, inter intra
SummaryThe methodology generalizes contention analysis of
multi-threaded applicationsNew approach to characterize applicationsUseful for performance analysis of existing and future
architecture or benchmarks Helpful for creating new workloads of diverse
properties
Provides insights for designing improved contention-aware scheduling methods
25
Related WorkCache contention
Knauerhase et al. IEEE Micro 2008Zhuravleve et al. ASPLOS 2010Xie et al. CMP-MSI 2008Mars et al. HiPEAC 2011
Characterizing parallel workload Jin et al., NASA Technical Report 2009
PARSEC benchmark suiteBienia et al. PACT 2008Bhadauria et al. IISWC 2009
26
Thank you!
27