Characterizing Multi-threaded Applications based on Shared-Resource Contention

ISPASS 2011

Characterizing Multi-threaded Applications based on

Shared-Resource Contention

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa

Department of Computer ScienceUniversity of Virginia

1

MotivationThe number of cores doubles every 18 monthsExpected: Performance number of coresOne of the bottlenecks is shared resource contention

For multi-threaded workloads, contention is unavoidable

To reduce contention, it is necessary to understand where and how the contention is created

2

Shared Resource Contention in Chip-Multiprocessors

Intel Quad Core Q9550

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Front -Side Bus

3

Application 1 Thread


Scenario 1 Multi-threaded applicationsWith co-runner

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

4



Without co-runner

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

Application Thread

5

Scenario 2Multi-threaded applications

Shared-Resource Contention

Intra-application contentionContention among threads from the same application

(No co-runners)

Inter-application contentionContention among threads from the co-running

application

6

ContributionsA general methodology to evaluate a multi-threaded

application’s performance Intra-application contention Inter-application contentionContention in the memory-hierarchy shared resources

Characterizing applications facilitates better understanding of the application’s resource sensitivity

Thorough performance analyses and characterization of multi-threaded PARSEC benchmarks 7

OutlineMotivationContributionsMethodologyMeasuring intra-application contentionMeasuring inter-application contentionRelated WorkSummary

8

Methodology

9

Designed to measure both intra- and inter-application contention for a targeted shared resourceL1-cache, L2-cacheFront Side Bus (FSB)

Each application is run in two configurationsBaseline: threads do not share the targeted resourceContention: threads share the targeted resource

Multiple number of targeted resourceDetermine contention by comparing performance

(gathering hardware performance counters’ values)

OutlineMotivationContributionsMethodologyMeasuring intra-application contention (See paper)Measuring inter-application contentionRelated WorkSummary

10

L1-cache

Baseline Configuration

Contention Configuration

Measuring inter-application contention

C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1



C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

11

Measuring inter-application contentionL2-cache



C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1



C0

C1

C2

C3

L2 L2

Memory

L1 L1L1 L1

12

Measuring inter-application contentionFSB


Memory

C0

C2

C4

C6

L2 L2

L1 L1L1 L1

C1

C3

C5

C7

L2 L2

L1 L1L1 L1



13

Measuring intra-application contentionFSB


Memory

C0

C2

C4

C6

L2 L2

L1 L1L1 L1

C1

C3

C5

C7

L2 L2

L1 L1L1 L1



14

PARSEC Benchmarks

15

Application Domain Benchmark(s)Financial Analysis Blackscholes (BS)

Swaptions (SW)Computer Vision Bodytrack (BT)Engineering Canneal (CN)Enterprise Storage Dedup (DD)Animation Facesim (FA)

Fluidanimate (FL)Similarity Search Ferret (FE)Rendering Raytrace (RT)Data Mining Streamcluster (SC)Media Processing Vips (VP)

X264 (X2)

Experimental platformPlatform 1: Yorkfield

Intel Quad core Q955032 KB L1-D and L1-I

cache6MB L2-cache2GB MemoryCommon FSB

C0

L2 cache

Memory

L1 cache

Memory Controller Hub (Northbridge)

FSB

MB

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C1

C2

C3

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

1616

Tanima Dey

Experimental platform

Memory

Memory Controller Hub (Northbridge)FSB

MB

FSB

C0

L2 cache

L1 cache

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C2

C4 C6

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

C1

L2 cache

L1 cache

FSB interface

L2 cache

L2 HW-PF

FSB interface

L2 HW-PF

L1 HW-PF

C3

C5

C7

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

L1 cache

L1 HW-PF

Platform 2: Harpertown

1717

18

Performance AnalysisInter-application contention

For i-th co-runnerPercentPerformanceDifferencei = ( PerformanceBasei – PerformanceContendi ) * 100

PerformanceBasei

Absolute performance difference sum

APDS = Σ abs ( PercentPerformanceDifferencei )

Inter-application contentionL1-cache – for Streamcluster

19

Bla

cksc

hole

s

Bod

ytra

ck

Can

neal

Ded

up

Face

sim

Ferr

et

Flui

dani

mat

e

Ray

trace

Swap

tions

Vips

X264

-8

-6

-4

-2

0

2

4

6

8Inter-application L1-cache Contention

Co-running benchmarks

Perfo

rman

ce D

iffer

ence

(%)

Inter-application L1-cache contention Streamcluster

20

Inter-application L1-cache Contention

-8-6-4-202468

Bla

cksc

hole

s

Bod

ytra

ck

Can

neal

Ded

up

Face

sim

Ferr

et

Flui

dani

mat

e

Ray

trace

Stre

amcl

uste

r

Swap

tions

Vips

X264

Co-running benchmarks

Perfo

rman

ce D

iffer

ence

(%)

21

Inter-application contention

21

L1-cache

Inter-application contention

22

L2-cache

Inter-application contentionFSB

23

Characterization

24

Benchmarks

L1-cache L2-cache FSB

Blackscholes

none none none

Bodytrack inter inter intraCanneal intra inter intraDedup inter intra, inter intra, interFacesim inter inter intraFerret intra intra, inter intraFluidanimate

inter inter intra

Raytrace none none intraStreamcluster

inter inter intra

Swaptions none none noneVips intra inter interX264 inter intra, inter intra

SummaryThe methodology generalizes contention analysis of

multi-threaded applicationsNew approach to characterize applicationsUseful for performance analysis of existing and future

architecture or benchmarks Helpful for creating new workloads of diverse

properties

Provides insights for designing improved contention-aware scheduling methods

25

Related WorkCache contention

Knauerhase et al. IEEE Micro 2008Zhuravleve et al. ASPLOS 2010Xie et al. CMP-MSI 2008Mars et al. HiPEAC 2011

Characterizing parallel workload Jin et al., NASA Technical Report 2009

PARSEC benchmark suiteBienia et al. PACT 2008Bhadauria et al. IISWC 2009

26

Thank you!

27

Documents

Characterizing Multi-threaded Applications based on Shared-Resource Contention