Download pptx - Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread

AllocationYaniv Ben-ItzhakProf. Israel CidonDr. Avinoam Kolodny

Department of Electrical Engineering, Technion – Israel Institute of Technology 1

Performance and Power Aware CMP Thread Allocation

max PerformancePower

Thread

Allocation

2

L2L1

L1

L1

L1

L1

L1

L1

L1

L1

Thre

ads

Performance maximization Use all the coresHigh power consumption

Power MinimizationSingle coreLow performance

$

Core

Router

Shared Cache

3

Performance Power Trade Off

Performance Power Metric (PPM)

Less Power ↔ More Performance( smaller α( )larger α )

4

PerformancePower

Preferred tradeoff between

performance and power .

In Short PPM

1

Outline

Performance and Power Model

Thread Allocation

Numerical Results

5

Simplified Performance ModelSingle coarse-grain multi-threaded core

* "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992

6

The model is an extension of Agarwal`s modelfor asymmetric threads

For simplicity we assume: No Sharing Effect

Miss-rate doesn`t depend on the number of threads Holds for small number of threads and large private

cache Miss rate & total memory access don`t vary by

time No context switch overhead

Thread i runs δi clocks until it suffers a L1 cache miss

T - Clocks to fetch from the shared cache

Terminology Single coarse-grain multi-threaded core

δ1

Idle Time

T

1 2 1 2 . . .

δ2

δ1

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

T

T=h·t+ TL2 $ access time

Hops Hop-Latency

7

$

t clocks latency

Memory Bound Case

8

When

max

kk

iiT

Core Utilization

Thread i Performance maxi

iiT

Each thread got executed every

maxk iik

T

δ1

Idle Time

T

1 2 1 2 . . .

δ2

δ1

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

T

max iT

. . . . . . . . . .δ1 T

1 2 1 2 . . .

δ2

δ1

T

. . . . . . . . . . M

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

maxk iik

T i

kk

Each thread is executed every clocks

kk

9

1 Core Utilization

Thread i PerformanceWhen

→ Saturation

CPU Bound Case

Saturation

1 Hop SaturationThreshold

Performance Per Thread

max

max

ik cache access time iiksat

i jk cache access time iikk

k

hops Tth

Thread Performancehops T

t

t

Saturation

1 Hop


2 Hops

max

max



k

hops Tth


t

t

SaturationThreshold

Saturation

1 Hop

max

max



k

hops Tth


t

t


2 Hops

Mor

e Ho

ps

SaturationThreshold

Core power consumption:

- Power consumption of a fully utilized core - Idle core power consumption

Power Model

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

activeP

idleP

13

Outline

Performance and Power Models

Thread Allocation

Numerical Results

14

The Thread Allocation Problem Given:

CMP Topology composed of M identical cores P applications each with Ti symmetric threads )1≤i≤P(. α – Preferred tradeoff between performance

and power.

Find thread allocation:

which maximizes

( )cin

core index

application index

15PPM

For simplicity:1) We assume that ni

(c) is continuous 2) Perform result DiscretizationAverage Thread Performanceα

Power α≥1

Minimum Utilization (MU)

Activating a core increases the power consumption by at least Pidle

In order to justify operation of a core, an appropriate increase of the performance is required.

PerformancePower

MU is the Minimum Utilization which justifies operating a core.

16

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

Reminder:

Minimum Utilization (MU)Calculation

17

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

cores

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

PPM1st case PPM2nd case

=

1st case: All threads are executed by single over saturated-core

Threads


First core inthreshold saturation

First core inthreshold saturation.

Utilization of the second core equals MU

Power increases

by Pidle

x 104

m=12nd case: The first core in threshold saturation and the remaining threads are executed by the second core.

PPM1st case = PPM2nd case

18

PP

M(

MIP

S2 /P

ower

)


19

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

cores

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

1st case 2nd case

=

11

1 mm active idle

active idle

m m MUMU P m P

P P m

Minimum Utilization (MU)Approximated Value and α Dependency

1idle

active idle

PMUP P

PerformancePower

Power is more important.Operate a core

only if it`s highly utilized

Performance is more important.

Operate a core even if its utilization is low

20Less Power ↔ More Performance

Min

imum

Util

izat

ion

)%(

α

The Thread Allocation AlgorithmHighlights Iterative.

In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation.

Operating a core only if MU Threshold is achieved.

Iterative

Threshold

Algorithm ITA 21

Outline

Performance and Power Model

Thread Allocation Problem

Numerical Results

22

How to evaluate ITA PPM? Compare average PPM values

ITA Equal Utilization Optimization algorithms

Scenarios 2-8 cores and 2-8 applications Using the following distributions:

23

( )(10,30) Cycles, (10,40) Cycles, (1,8) Hops(1,40) Cycles, T (1, 25) Threads, (1,6)

ccache

i i

T U U h UU U U

t

4.7 Cores

7.9 Cores

7.2 Cores

)5,8(

3.6 Cores

)2,5(

24

Average

Improvement of

47%

Equal Utilization Comparison

ITA

Equal Utilization

PPMPPM

The best PPM of: Constrained Nonlinear Optimization Pattern Search Algorithm Genetic Algorithm

These methods were run for 10,000X longer than ITA

Comparison with Optimization Methods

25

Optimization Methods Compare

ITA: 4.7 CoresOpt. Methods: 7.1 Cores


)2,5(

)5,8(


ITA: 7.9 CoresOpt. Methods: 8 Cores

26

Average Improvement of

9% max

ITA

Optimization Method

PPMPPM

Applications Cores

Summary Tunable Performance Power Metric Minimum Utilization concept Approach for low computational thread

allocation on CMP Future work:

Extension for distributed cache Threads and data co-allocation

Sharing Effect Consideration Heterogonous CMPs

27

Questions ?

28

Backup

29

Performance Power Metric Follows definitions used in logic circuit

design. If E is the energy and t is the delay,

Penzes & Martin introduces the metric E•tα, where α becomes larger as the performance becomes more important.

30* “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J 12th ACM Great Lakes Symposium on VLSI, 2002.

PerformancePPMPower

Minimum Utilization (MU)Calculation Cont.

MU value depends on how many cores are already operating.

For large enough value of m, MU value is constant….

Approximate constant value is reasonable…)Keep it Simple…(

MU of (m+1)th core

31

α=1

α=1.5

α=2α=2.5

α=3

11

1 mm active idle

active idle

m m MUMU P m P

P P m

MU vs. Pidle/Pactive

32

α=1

α=1.2

α=1.4α=1.6

α=1.8α=2

Pidle/Pactive

Min

imum

Util

izat

ion

)%(

Previous Work

35

Their Work My Work Ding et al. from

Pennsylvania State University

Thread allocation based on integer

linear programming that maximizes a

performance/power metric on single-threaded cores

Maximizes tunable metric

performanceα/power

Dealing with multi-threaded cores.

Miao et al. from Jiaotong University

First maximize performance and

then minimize power consumption. Using Genetic and Greedy

algorithms.

Optimizes performance and

power simultaneously

Previous Work Neglecting Sharing Effect

Fedorova et al. "Chip multithreading systems need a new operating system scheduler“ Its goal is to highly utilize the cores. Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the

pipeline resource contention. Neglect the sharing effect among threads )Similar to my research…(

Doesn`t take into account varying distances of cores from the L2 shared cache .Doesn`t consider the power consumption.

36

Discritization There are a lot of discritization methods …

Use Histogram Specification method )image processing(

1

(1) (1) ( ) ( ) ( )

1 1

: , ; 2 c Mc c

c k ki i i i i

k k

i m round n m round round n m

37

Results DiscretizationExample

Num

ber o

f Thr

eads

Core Hops Distance

DC

DDD

C

D

C CC

1

(1) (1) ( ) ( ) ( )

1 1

: , ; 2 c Mc c

c k ki i i i i

k k

i m round n m round round n m

On average, results discretization

reduces PPM value by 5%

.

38

Results Discretization

39

InitializeCurrent core=Closer core to the shared cache

Current application= Application with highest miss rate

Allocate threads of current application over current core

until at most threshold saturation

All the threads of current application were

allocated? Last application?

Current application = Unallocated application with highest cache miss

Finish

All unallocated

threads achieve MU on the next

available closer core?

Current core = The next available closer core to the shared cache

Allocate all remainingthreads over already

operating cores)over saturation(

Finish

Current core at saturation threshold?

Last core ?

Yes,Not last application

Yes,Last application

,Yes. Not last core

Yes

No

No

No ,Yes.

Last core

ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the

optimization methods.It outperforms the best of

optimization methods by 9%.

Time Complexity ComparisonOptimization methods to ITA Ratio

41

ITA Operations / Minimum of Optimization Methods Operations

ApplicationsCores

42

,

1i

m i ir mr

Ratio of memory access instructions out of the total instruction

mix of thread i

Cache miss rate for thread i