38
An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia Sep 11, 2013

An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

Jiacheng Zhao

Institute of Computing Technology, CAS

In Conjunction with Prof. Jingling Xue,

UNSW, Australia

Sep 11, 2013

Page 2: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Problem – Resource Utilization in Datacenters

How?

2013/9/11

ASPLOS’09 by David Meisner+

Page 3: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Problem – Resource Utilization in Datacenters

Applications

Core Core

L1 L1

Co-Runners

Core Core

L1 L1

Shared Cache

Memory Controller

2013/9/11

Co-located applications

Contention for shared cache, shared IMC, etc.

Negative and unpredictable interference

Two types of applications

Batch – No QoS guarantees

Latency Sensitive - Attain high QoS

Co-location is disabled

Low server utilization

Lacking the knowledge of interference

Page 4: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Problem – Resource Utilization in Datacenters

Co-located applications

Contention for shared cache, shared IMC, etc.

Negative and unpredictable interference

Two types of applications

Batch – No QoS guarantees

Latency Sensitive - Attain high QoS

Co-location is disabled

Low server utilization

Lacking the knowledge of interference

2013/9/11

Page 5: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Problem – Resource Utilization in Datacenters

Co-located applications

Contention for shared cache, shared IMC, etc.

Negative and unpredictable interference

Two types of applications

Batch – No QoS guarantees

Latency Sensitive - Attain high QoS

Co-location is disabled

Low server utilization

Lacking the knowledge of interference

2013/9/11

Figure: Task placement in datacenters

[Micro’11 by Jason Mars+]

Page 6: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Goals: Predicting the interference

Quantitatively predict the cross-core performance interference

Applicable for arbitrarily co-locations

Identify any “safe” co-locations

Deployable for datacenters

2013/9/11

Page 7: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Intuition – Mining a model from large training data

2013/9/11

Using machine learning approaches

Training Set

Page 8: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Motivation example

2013/9/11

𝑃𝐷𝑚𝑐𝑓 =

0.485𝑃𝑏𝑤 + 0.183𝑃𝑐𝑎𝑐ℎ𝑒 − 0.138, 𝑖𝑓 𝑃𝑏𝑤 < 3.20.706𝑃𝑏𝑤 + 1.725𝑃𝑐𝑎𝑐ℎ𝑒 − 0.220, 𝑖𝑓 3.2 ≤ 𝑃𝑏𝑤 ≤ 9.60.907𝑃𝑏𝑤 + 3.087𝑃𝑐𝑎𝑐ℎ𝑒 − 0.561, 𝑖𝑓 𝑃𝑏𝑤 > 9.6

Page 9: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Outline

Introduction

Our Key Observations

Our Approach – Two-Phase Approach

Experimental Results

Conclusion

2013/9/11

Page 10: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Key Observations

Observation 1: The function depends only on the pressure on shared

resources, regardless of individual pressures from one co-runner.

For an application A, PDA = f(Pcache, Pbw)

(Pcache, Pbw) = g(A1,A2,…,Am)

2013/9/11

Page 11: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Key Observations

Observation 2:

The function f is piecewise.

2013/9/11

Page 12: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Key Observations

Naively, we can create A’s prediction model using brute-force approach

BUT, we can NOT apply brute force approach for each application!

Thousands of applications in one datacenter

Frequent software updates

Different generations of processors

Even steps for one application is expensive

Observation 3:

The function form is platform-dependent and application independent

Only the coefficients are application-dependent

2013/9/11

Page 13: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Outline

Introduction

Our Key Observations

Our Approach - Two-Phase Approach

Experimental Results

Conclusion

2013/9/11

Page 14: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach - Two-Phase Approach

2013/9/11

Phase 1: Get the abstract model

Find a function form best suitable for all applications on a given platform

Phase 2: Instantiate the abstract model

Determine the application-specific coefficients (a11, etc.)

Training Applications

Co-running Trainer

Heavy – many training workloads

Run once for one platform

𝑃D =

𝑎11𝑃𝑏𝑤 + 𝑎12𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎13, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛1𝑎21𝑃𝑏𝑤 + 𝑎22𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎23, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛2𝑎31𝑃𝑏𝑤 + 𝑎32𝑃𝑐𝑎𝑐ℎ𝑒 + 𝑎33, 𝑠𝑢𝑏𝑑𝑜𝑚𝑎𝑖𝑛3

OneApplication

Co-running Trainer

Light-weighted, with a small number of trainings

Run once for one application

𝑃𝐷𝑚𝑐𝑓 =

0.49𝑃𝑏𝑤 + 0.18𝑃𝑐𝑎𝑐ℎ𝑒 − 0.13, 𝑃𝑏𝑤 < 3.20.71𝑃𝑏𝑤 + 1.73𝑃𝑐𝑎𝑐ℎ𝑒 − 0.22, 𝑜𝑡ℎ𝑒𝑟𝑠

0.91𝑃𝑏𝑤 + 3.09𝑃𝑐𝑎𝑐ℎ𝑒 − 0.56, 𝑃𝑏𝑤 > 9.6

Page 15: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach - Two-Phase Approach

2013/9/11

Page 16: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach - Two-Phase Approach

2013/9/11

Q1: What are selected as

application features

Q2: How?

Q3: What’s the cost of the training?

Page 17: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach – Some Key Points

2013/9/11

Q1: What are selected as application features?

Runtime profiles

Shared cache consumption

Bandwidth consumption

Page 18: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach – Some Key Points

2013/9/11

Q2: How to create the abstract model?

Regression analysis

Configurable

Each configuration

binding to a function form

Searching for the best function form for all applications in the training set

Page 19: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Our Approach – Some Key Points

2013/9/11

Q3: What’s the cost of the training when instantiation

Cover all sub-domains of the piecewise function, say S

Constant points for each sub-domain, say C

The constant depends on the form of abstraction model

C*S training runs in total

Usually C and S are small, our experience: C=4, S=3

Page 20: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Outline

Introduction

Our Key Observations

Our Approach - Two-Phase Approach

Experimental Results

Conclusion

2013/9/11

Page 21: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Experimental Results

2013/9/11

Accuracy of our two-phase regression approach

Prediction precision

Error analysis

Deployment in a datacenter

Utilization gained

QoS enforced and violated

Page 22: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Experimental Results

2013/9/11

Benchmarks:

SPEC2006

Nine real-world datacenter applications

Nlp-mt, openssl, openclas, MR-iindex, etc.

Platforms:

Intel quad-core Xeon E5506 (main)

Datacenter:

300 quad-core Xeon E5506

Page 23: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Some Predictor Function

2013/9/11

Page 24: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Prediction precision for SPEC Benchmarks

2013/9/11

Prediction Error: Average 0.2%, from 0.0% to 8.6%.

Page 25: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Prediction precision for datacenter applications

15 workloads for each datacenter applications

2013/9/11

Prediction Error: Average 0.3%, from 0.0% to 5%.

Page 26: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Error Distribution

2013/9/11

-4.00%

-3.00%

-2.00%

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

Error Distribution

Page 27: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Prediction Efficiency

Precision

Two-Phase:

0.0~11.7%, Average: 0.40%

Brute-Force

0.0~10.1%, Average: 0.23%

Efficiency

co-running: ~200 12

2013/9/11

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20P

erf

orm

ance

De

grad

atio

nWorkload ID

Real Two-Phase Brute-Force

Page 28: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Benefits of piecewise predictor functions

2013/9/11

Page 29: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Benefits of piecewise predictor functions

2013/9/11

Page 30: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Deployment in a datacenter

2013/9/11

300 quad-core Xeon 1200 tasks when fully occupied

Applications Latency sensitive: Nlp-mt

machine translation

600 dedicated cores, 2/chip

Batch job

600 tasks, kmeans, MR

Our Purpose QoS policy

Issue batch jobs to idle cores

Page 31: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Cross-platform applicability

Six-core Intel Xeon

2013/9/11

0%

20%

40%

60%

80%

1 6 11 16 21 26 AVG

Per

form

ance

Deg

rad

atio

n

Workload ID

Real Predicted

Prediction Error: Average 0.1%, range from 0.0% to 10.2%

Page 32: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Cross-platform applicability

Quad-core AMD

2013/9/11

Prediction Error: Average 0.3%, range from 0.0% to 5.1%

0%

10%

20%

30%

40%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 AVG

Per

form

ance

Deg

rad

atio

n

Workload ID

Real Predicted

Page 33: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Outline

Introduction

Our Key Observations

Our Approach - Two-Phase Approach

Experimental Results

Conclusion

2013/9/11

Page 34: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Conclusion

An empirical model, based on our key observations

Using aggregated resource consumptions to create the predictor function, thus

working for arbitrarily co-locations

Piecewise is reasonable and effective

Breaking the model creation into two phases, for efficiency

2013/9/11

Page 35: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

2013/9/11

Page 36: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Backup slides

2013/9/11

How to make the training set representative?

Partition the space into grids

Sample for each grid

Page 37: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Backup slides

2013/9/11

How to do domain partitioning?

Specified in configuration file

Syntax: (shared resourcei, conditioni), e.g. (Pbw, equal(4))

Empirical knowledge to perform this task

#Aggregation#Pre-Processing: none, exp(2), log(2), pow(2)#mode: add, mul

#Domain Partitioning: {((Pbw), equal(4)), ((Pcache), equal(4)), ((Pcache, Pbw), equal(4, 4))},#Function: linear, polynomial(2)

Page 38: An Empirical Model for Predicting Cross-Core Performance ...conferences.inf.ed.ac.uk/pact2013/slides/JiachengZhao.pdf · Latency Sensitive - Attain high QoS ... Thousands of applications

Backup slides

Two sources of error:

Estimation for shared resources

consumption

L2 LinesIn

Phase behavior of applications

2013/9/11