40
Performance and Power Aware CMP Thread Allocation Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology 1

Performance and Power Aware CMP Thread Allocation

  • Upload
    joella

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Performance and Power Aware CMP Thread Allocation. Yaniv Ben-Itzhak Prof. Israel Cidon Dr. Avinoam Kolodny Department of Electrical Engineering, Technion – Israel Institute of Technology. Performance and Power Aware CMP Thread Allocation. Thread Allocation. L1. L1. L1. L2. Threads. - PowerPoint PPT Presentation

Citation preview

Page 1: Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread

AllocationYaniv Ben-ItzhakProf. Israel CidonDr. Avinoam Kolodny

Department of Electrical Engineering, Technion – Israel Institute of Technology 1

Page 2: Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread Allocation

max PerformancePower

Thread

Allocation

2

L2L1

L1

L1

L1

L1

L1

L1

L1

L1

Thre

ads

Page 3: Performance and Power Aware CMP Thread Allocation

Performance maximization Use all the coresHigh power consumption

Power MinimizationSingle coreLow performance

$

Core

Router

Shared Cache

3

Performance Power Trade Off

Page 4: Performance and Power Aware CMP Thread Allocation

Performance Power Metric (PPM)

Less Power ↔ More Performance( smaller α( )larger α )

4

PerformancePower

Preferred tradeoff between

performance and power .

In Short PPM

1

Page 5: Performance and Power Aware CMP Thread Allocation

Outline

Performance and Power Model

Thread Allocation

Numerical Results

5

Page 6: Performance and Power Aware CMP Thread Allocation

Simplified Performance ModelSingle coarse-grain multi-threaded core

* "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992

6

The model is an extension of Agarwal`s modelfor asymmetric threads

For simplicity we assume: No Sharing Effect

Miss-rate doesn`t depend on the number of threads Holds for small number of threads and large private

cache Miss rate & total memory access don`t vary by

time No context switch overhead

Page 7: Performance and Power Aware CMP Thread Allocation

Thread i runs δi clocks until it suffers a L1 cache miss

T - Clocks to fetch from the shared cache

Terminology Single coarse-grain multi-threaded core

δ1

Idle Time

T

1 2 1 2 . . .

δ2

δ1

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

T

T=h·t+ TL2 $ access time

Hops Hop-Latency

7

$

t clocks latency

Page 8: Performance and Power Aware CMP Thread Allocation

Memory Bound Case

8

When

max

kk

iiT

Core Utilization

Thread i Performance maxi

iiT

Each thread got executed every

maxk iik

T

δ1

Idle Time

T

1 2 1 2 . . .

δ2

δ1

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

T

max iT

Page 9: Performance and Power Aware CMP Thread Allocation

. . . . . . . . . .δ1 T

1 2 1 2 . . .

δ2

δ1

T

. . . . . . . . . . M

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

maxk iik

T i

kk

Each thread is executed every clocks

kk

9

1 Core Utilization

Thread i PerformanceWhen

→ Saturation

CPU Bound Case

Page 10: Performance and Power Aware CMP Thread Allocation

Saturation

1 Hop SaturationThreshold

Performance Per Thread

max

max

ik cache access time iiksat

i jk cache access time iikk

k

hops Tth

Thread Performancehops T

t

t

Page 11: Performance and Power Aware CMP Thread Allocation

Saturation

1 Hop

Performance Per Thread

2 Hops

max

max

ik cache access time iiksat

i jk cache access time iikk

k

hops Tth

Thread Performancehops T

t

t

SaturationThreshold

Page 12: Performance and Power Aware CMP Thread Allocation

Saturation

1 Hop

max

max

ik cache access time iiksat

i jk cache access time iikk

k

hops Tth

Thread Performancehops T

t

t

Performance Per Thread

2 Hops

Mor

e Ho

ps

SaturationThreshold

Page 13: Performance and Power Aware CMP Thread Allocation

Core power consumption:

- Power consumption of a fully utilized core - Idle core power consumption

Power Model

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

activeP

idleP

13

Page 14: Performance and Power Aware CMP Thread Allocation

Outline

Performance and Power Models

Thread Allocation

Numerical Results

14

Page 15: Performance and Power Aware CMP Thread Allocation

The Thread Allocation Problem Given:

CMP Topology composed of M identical cores P applications each with Ti symmetric threads )1≤i≤P(. α – Preferred tradeoff between performance

and power.

Find thread allocation:

which maximizes

( )cin

core index

application index

15PPM

For simplicity:1) We assume that ni

(c) is continuous 2) Perform result DiscretizationAverage Thread Performanceα

Power α≥1

Page 16: Performance and Power Aware CMP Thread Allocation

Minimum Utilization (MU)

Activating a core increases the power consumption by at least Pidle

In order to justify operation of a core, an appropriate increase of the performance is required.

PerformancePower

MU is the Minimum Utilization which justifies operating a core.

16

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

Reminder:

Page 17: Performance and Power Aware CMP Thread Allocation

Minimum Utilization (MU)Calculation

17

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

cores

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

PPM1st case PPM2nd case

=

Page 18: Performance and Power Aware CMP Thread Allocation

1st case: All threads are executed by single over saturated-core

Threads

Minimum Utilization (MU)Calculation

First core inthreshold saturation

First core inthreshold saturation.

Utilization of the second core equals MU

Power increases

by Pidle

x 104

m=12nd case: The first core in threshold saturation and the remaining threads are executed by the second core.

PPM1st case = PPM2nd case

18

PP

M(

MIP

S2 /P

ower

)

Page 19: Performance and Power Aware CMP Thread Allocation

Minimum Utilization (MU)Calculation

19

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

cores

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

1st case 2nd case

=

11

1 mm active idle

active idle

m m MUMU P m P

P P m

Page 20: Performance and Power Aware CMP Thread Allocation

Minimum Utilization (MU)Approximated Value and α Dependency

1idle

active idle

PMUP P

PerformancePower

Power is more important.Operate a core

only if it`s highly utilized

Performance is more important.

Operate a core even if its utilization is low

20Less Power ↔ More Performance

Min

imum

Util

izat

ion

)%(

α

Page 21: Performance and Power Aware CMP Thread Allocation

The Thread Allocation AlgorithmHighlights Iterative.

In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation.

Operating a core only if MU Threshold is achieved.

Iterative

Threshold

Algorithm ITA 21

Page 22: Performance and Power Aware CMP Thread Allocation

Outline

Performance and Power Model

Thread Allocation Problem

Numerical Results

22

Page 23: Performance and Power Aware CMP Thread Allocation

How to evaluate ITA PPM? Compare average PPM values

ITA Equal Utilization Optimization algorithms

Scenarios 2-8 cores and 2-8 applications Using the following distributions:

23

( )(10,30) Cycles, (10,40) Cycles, (1,8) Hops(1,40) Cycles, T (1, 25) Threads, (1,6)

ccache

i i

T U U h UU U U

t

Page 24: Performance and Power Aware CMP Thread Allocation

4.7 Cores

7.9 Cores

7.2 Cores

)5,8(

3.6 Cores

)2,5(

24

Average

Improvement of

47%

Equal Utilization Comparison

ITA

Equal Utilization

PPMPPM

Page 25: Performance and Power Aware CMP Thread Allocation

The best PPM of: Constrained Nonlinear Optimization Pattern Search Algorithm Genetic Algorithm

These methods were run for 10,000X longer than ITA

Comparison with Optimization Methods

25

Page 26: Performance and Power Aware CMP Thread Allocation

Optimization Methods Compare

ITA: 4.7 CoresOpt. Methods: 7.1 Cores

ITA: 3.6 CoresOpt. Methods: 4.6 Cores

)2,5(

)5,8(

ITA: 7.1 CoresOpt. Methods: 7.9 Cores

ITA: 7.9 CoresOpt. Methods: 8 Cores

26

Average Improvement of

9% max

ITA

Optimization Method

PPMPPM

Applications Cores

Page 27: Performance and Power Aware CMP Thread Allocation

Summary Tunable Performance Power Metric Minimum Utilization concept Approach for low computational thread

allocation on CMP Future work:

Extension for distributed cache Threads and data co-allocation

Sharing Effect Consideration Heterogonous CMPs

27

Page 28: Performance and Power Aware CMP Thread Allocation

Questions ?

28

Page 29: Performance and Power Aware CMP Thread Allocation

Backup

29

Page 30: Performance and Power Aware CMP Thread Allocation

Performance Power Metric Follows definitions used in logic circuit

design. If E is the energy and t is the delay,

Penzes & Martin introduces the metric E•tα, where α becomes larger as the performance becomes more important.

30* “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J 12th ACM Great Lakes Symposium on VLSI, 2002.

PerformancePPMPower

Page 31: Performance and Power Aware CMP Thread Allocation

Minimum Utilization (MU)Calculation Cont.

MU value depends on how many cores are already operating.

For large enough value of m, MU value is constant….

Approximate constant value is reasonable…)Keep it Simple…(

MU of (m+1)th core

31

α=1

α=1.5

α=2α=2.5

α=3

11

1 mm active idle

active idle

m m MUMU P m P

P P m

Page 32: Performance and Power Aware CMP Thread Allocation

MU vs. Pidle/Pactive

32

α=1

α=1.2

α=1.4α=1.6

α=1.8α=2

Pidle/Pactive

Min

imum

Util

izat

ion

)%(

Page 33: Performance and Power Aware CMP Thread Allocation

Previous Work

35

Their Work My Work  Ding et al. from

Pennsylvania State University

Thread allocation based on integer

linear programming that maximizes a

performance/power metric on single-threaded cores

Maximizes tunable metric

performanceα/power

Dealing with multi-threaded cores.

Miao et al. from Jiaotong University

First maximize performance and

then minimize power consumption. Using Genetic and Greedy

algorithms.

Optimizes performance and

power simultaneously

Page 34: Performance and Power Aware CMP Thread Allocation

Previous Work Neglecting Sharing Effect

Fedorova et al. "Chip multithreading systems need a new operating system scheduler“ Its goal is to highly utilize the cores. Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the

pipeline resource contention. Neglect the sharing effect among threads )Similar to my research…(

Doesn`t take into account varying distances of cores from the L2 shared cache .Doesn`t consider the power consumption.

36

Page 35: Performance and Power Aware CMP Thread Allocation

Discritization There are a lot of discritization methods …

Use Histogram Specification method )image processing(

1

(1) (1) ( ) ( ) ( )

1 1

: , ; 2 c Mc c

c k ki i i i i

k k

i m round n m round round n m

37

Page 36: Performance and Power Aware CMP Thread Allocation

Results DiscretizationExample

Num

ber o

f Thr

eads

Core Hops Distance

DC

DDD

C

D

C CC

1

(1) (1) ( ) ( ) ( )

1 1

: , ; 2 c Mc c

c k ki i i i i

k k

i m round n m round round n m

On average, results discretization

reduces PPM value by 5%

.

38

Page 37: Performance and Power Aware CMP Thread Allocation

Results Discretization

39

Page 38: Performance and Power Aware CMP Thread Allocation

InitializeCurrent core=Closer core to the shared cache

Current application= Application with highest miss rate

Allocate threads of current application over current core

until at most threshold saturation

All the threads of current application were

allocated? Last application?

Current application = Unallocated application with highest cache miss

Finish

All unallocated

threads achieve MU on the next

available closer core?

Current core = The next available closer core to the shared cache

Allocate all remainingthreads over already

operating cores)over saturation(

Finish

Current core at saturation threshold?

Last core ?

Yes,Not last application

Yes,Last application

,Yes. Not last core

Yes

No

No

No ,Yes.

Last core

Page 39: Performance and Power Aware CMP Thread Allocation

ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the

optimization methods.It outperforms the best of

optimization methods by 9%.

Time Complexity ComparisonOptimization methods to ITA Ratio

41

ITA Operations / Minimum of Optimization Methods Operations

ApplicationsCores

Page 40: Performance and Power Aware CMP Thread Allocation

42

,

1i

m i ir mr

Ratio of memory access instructions out of the total instruction

mix of thread i

Cache miss rate for thread i