Performance and Power Aware CMP Thread Allocation

Performance and Power Aware CMP Thread

AllocationYaniv Ben-ItzhakProf. Israel CidonDr. Avinoam Kolodny

Department of Electrical Engineering, Technion – Israel Institute of Technology 1

max PerformancePower

Thread

Allocation

Performance maximization Use all the coresHigh power consumption

Power MinimizationSingle coreLow performance

Router

Shared Cache

Performance Power Trade Off

Performance Power Metric (PPM)

Less Power ↔ More Performance( smaller α( )larger α )

PerformancePower

Preferred tradeoff between

performance and power .

In Short PPM

Outline

Performance and Power Model

Thread Allocation

Numerical Results

Simplified Performance ModelSingle coarse-grain multi-threaded core

* "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992

The model is an extension of Agarwal`s modelfor asymmetric threads

For simplicity we assume: No Sharing Effect

Miss-rate doesn`t depend on the number of threads Holds for small number of threads and large private

cache Miss rate & total memory access don`t vary by

time No context switch overhead

Thread i runs δi clocks until it suffers a L1 cache miss

T - Clocks to fetch from the shared cache

Terminology Single coarse-grain multi-threaded core

Idle Time

1 2 1 2 . . .

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

T=h·t+ TL2 $ access time

Hops Hop-Latency

t clocks latency

Memory Bound Case

Core Utilization

Thread i Performance maxi

Each thread got executed every

maxk iik

Idle Time

1 2 1 2 . . .

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

max iT

. . . . . . . . . .δ1 T

1 2 1 2 . . .

. . . . . . . . . . M

Thread 1 Cache miss

Thread 2 Cache miss

Thread 1 Cache

response

Thread 2 Cache

response

maxk iik

Each thread is executed every clocks

1 Core Utilization

Thread i PerformanceWhen

→ Saturation

CPU Bound Case

Saturation

1 Hop SaturationThreshold

Performance Per Thread

ik cache access time iiksat

i jk cache access time iikk

hops Tth

Thread Performancehops T

Saturation

2 Hops

hops Tth

SaturationThreshold

Saturation

hops Tth

2 Hops

SaturationThreshold

Core power consumption:

- Power consumption of a fully utilized core - Idle core power consumption

Power Model

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

activeP

Outline

Performance and Power Models

Thread Allocation

Numerical Results

The Thread Allocation Problem Given:

CMP Topology composed of M identical cores P applications each with Ti symmetric threads )1≤i≤P(. α – Preferred tradeoff between performance

and power.

Find thread allocation:

which maximizes

( )cin

core index

application index

For simplicity:1) We assume that ni

(c) is continuous 2) Perform result DiscretizationAverage Thread Performanceα

Power α≥1

Minimum Utilization (MU)

Activating a core increases the power consumption by at least Pidle

In order to justify operation of a core, an appropriate increase of the performance is required.

PerformancePower

MU is the Minimum Utilization which justifies operating a core.

(1 ) ; 0< 1Core Power

0 ; =0active idleP P

Reminder:

Minimum Utilization (MU)Calculation

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

PPM1st case PPM2nd case

1st case: All threads are executed by single over saturated-core

Threads

First core inthreshold saturation

First core inthreshold saturation.

Utilization of the second core equals MU

Power increases

by Pidle

m=12nd case: The first core in threshold saturation and the remaining threads are executed by the second core.

PPM1st case = PPM2nd case

Compare the PPM value of two cases:

Threads are executed by: m over-saturated

Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU

1st case 2nd case

1 mm active idle

active idle

m m MUMU P m P

Minimum Utilization (MU)Approximated Value and α Dependency

active idle

PMUP P

PerformancePower

Power is more important.Operate a core

only if it`s highly utilized

Performance is more important.

Operate a core even if its utilization is low

20Less Power ↔ More Performance

The Thread Allocation AlgorithmHighlights Iterative.

In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation.

Operating a core only if MU Threshold is achieved.

Iterative

Threshold

Algorithm ITA 21

Outline

Performance and Power Model

Thread Allocation Problem

Numerical Results

How to evaluate ITA PPM? Compare average PPM values

ITA Equal Utilization Optimization algorithms

Scenarios 2-8 cores and 2-8 applications Using the following distributions:

( )(10,30) Cycles, (10,40) Cycles, (1,8) Hops(1,40) Cycles, T (1, 25) Threads, (1,6)

ccache

T U U h UU U U

4.7 Cores

7.9 Cores

7.2 Cores

3.6 Cores

Average

Improvement of

Equal Utilization Comparison

Equal Utilization

PPMPPM

The best PPM of: Constrained Nonlinear Optimization Pattern Search Algorithm Genetic Algorithm

These methods were run for 10,000X longer than ITA

Comparison with Optimization Methods

Optimization Methods Compare

ITA: 4.7 CoresOpt. Methods: 7.1 Cores

ITA: 7.9 CoresOpt. Methods: 8 Cores

Average Improvement of

9% max

Optimization Method

PPMPPM

Applications Cores

Summary Tunable Performance Power Metric Minimum Utilization concept Approach for low computational thread

allocation on CMP Future work:

Extension for distributed cache Threads and data co-allocation

Sharing Effect Consideration Heterogonous CMPs

Questions ?

Backup

Performance Power Metric Follows definitions used in logic circuit

design. If E is the energy and t is the delay,

Penzes & Martin introduces the metric E•tα, where α becomes larger as the performance becomes more important.

30* “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J 12th ACM Great Lakes Symposium on VLSI, 2002.

PerformancePPMPower

Minimum Utilization (MU)Calculation Cont.

MU value depends on how many cores are already operating.

For large enough value of m, MU value is constant….

Approximate constant value is reasonable…)Keep it Simple…(

MU of (m+1)th core

α=1.5

α=2α=2.5

1 mm active idle

active idle

m m MUMU P m P

MU vs. Pidle/Pactive

α=1.2

α=1.4α=1.6

α=1.8α=2

Pidle/Pactive

Previous Work

Their Work My Work Ding et al. from

Pennsylvania State University

Thread allocation based on integer

linear programming that maximizes a

performance/power metric on single-threaded cores

Maximizes tunable metric

performanceα/power

Dealing with multi-threaded cores.

Miao et al. from Jiaotong University

First maximize performance and

then minimize power consumption. Using Genetic and Greedy

algorithms.

Optimizes performance and

power simultaneously

Previous Work Neglecting Sharing Effect

Fedorova et al. "Chip multithreading systems need a new operating system scheduler“ Its goal is to highly utilize the cores. Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the

pipeline resource contention. Neglect the sharing effect among threads )Similar to my research…(

Doesn`t take into account varying distances of cores from the L2 shared cache .Doesn`t consider the power consumption.

Discritization There are a lot of discritization methods …

Use Histogram Specification method )image processing(

(1) (1) ( ) ( ) ( )

: , ; 2 c Mc c

c k ki i i i i

i m round n m round round n m

Results DiscretizationExample

Core Hops Distance

(1) (1) ( ) ( ) ( )

: , ; 2 c Mc c

c k ki i i i i

i m round n m round round n m

On average, results discretization

reduces PPM value by 5%

Results Discretization

InitializeCurrent core=Closer core to the shared cache

Current application= Application with highest miss rate

Allocate threads of current application over current core

until at most threshold saturation

All the threads of current application were

allocated? Last application?

Current application = Unallocated application with highest cache miss

Finish

All unallocated

threads achieve MU on the next

available closer core?

Current core = The next available closer core to the shared cache

Allocate all remainingthreads over already

operating cores)over saturation(

Finish

Current core at saturation threshold?

Last core ?

Yes,Not last application

Yes,Last application

,Yes. Not last core

No ,Yes.

Last core

ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the

optimization methods.It outperforms the best of

optimization methods by 9%.

Time Complexity ComparisonOptimization methods to ITA Ratio

ITA Operations / Minimum of Optimization Methods Operations

ApplicationsCores

m i ir mr

Ratio of memory access instructions out of the total instruction

mix of thread i

Cache miss rate for thread i

Performance and Power Aware CMP Thread Allocation

Documents

CMP Upgrade 2019/20 - ActEd Upgrade/CB1 CMP Upgrad… · CMP Upgrade 2019/20 Subject CB1 CMP Upgrade This CMP Upgrade lists the changes to the Syllabus objectives, Core Reading and

Thread Clustering: Sharing-Aware Scheduling on S MP-CMP-SMT …csl.skku.edu/uploads/ECE5658S17/week4b.pdf · 2017-03-26 · Thread Clustering: Sharing-Aware Scheduling on S MP-CMP-SMT

CMP BUILDING | CMP GRUP

2015 VHM / PM− HSS− WERKZEUGE · -Thread milling cutter UNF - internal thread-Thread milling cutter BSP(G) - internal thread-Thread milling cutter NPT - internal thread ... Dimensions

Thread Gauges | Thread Ring Gages | Thread Check

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z

EECC722 - Shaaban #1 lec # 10 Fall 2006 10-25-2006 Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core

CMP PRODUCTS · or for surface or underground mining applications, ... Marine Approvals LRS: 01/00171 (E1), ... Thread Length “E” Min Max Max Max Size Type

THREADS THREAD SYNCHRONIZATION - Computer Sciencedewan/734/current/lectures/2-Synchronization.pdf · Thread 5 Thread 1 Thread 3 Processor Thread 4 Non Ready Threads Thread 2 Thread

Resettlement Action Plan...CMP Current Market Price CCL Cash Compensation under Law CLAC Central land Allocation Committee CRO Chief Resettlement Officer DAE Department of Agriculture

CABLE GLAND AND CABLE CONNECTION SPECIALISTS CMP · Cable Gland Size Metric Entry Thread Minimum Thread Length A2SOLO Ordering Reference LSF Shroud Reference E1WSOLO Or dering Reference

Allocation of Student Credit Hours - FINAL-REV Spring 2016 · CMP 4280 001 Ecol Planning Wkshp OOSTEMA BROWN,CHRIST 14 56.0 CMP ... DES 4611 001 Design Studio IV RUSAY,CHRISTOPHER

Paradigms Shifts in CMP - NCCAVS Usergroups€¦ · Paradigm Shifts in CMP 2 State of CMP Market Inflections are driving CMP Growth and Complexity Paradigm Shifts in CMP Strategies

CMP PRODUCTS THREAD CONVERSIONS & ACCESSORIES€¦ · Compliance Standards UL 1203 EAC Certificate TC RU C-GB.AA87.B.00487 GOST R Industrial Certificate POCC GB. AГ35.H00102 (applies

Power-Capped DVFS and Thread Allocation with ANN Models on ... · (CV) of the execution time among ﬁve runs below 0.03 for all benchmarks. However, thread binding signiﬁcantly

12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

CMP PRODUCTS CABLE GLAND CATALOGUE - … Glands.pdf · CMP EXPLOSIVE ATMOSPHERES PRODUCTS CMP CMP PRODUCTS CABLE GLAND CATALOGUE Cable Gland Cable TECHNICAL DATA

Series CMP – Model CMP Ceiling Diffuser

Open Standards for AR - Khronos Group€¦ · Single thread per context GPU ThinDriver Explicit GPU Control Application Memory allocation Thread management Synchronization Multi-threaded

CMP PRODUCTS THREAD CONVERSIONS & ACCESSORIES€¦ · UkrSEPRO UA.TR.047.C.0644-15 KCS Certificate 14-GA4BO-0249X CCOE / PESO (India) Certificate P333688 NEPSI Certificate GYJ13.1142X