Performance and Power Aware CMP Thread
AllocationYaniv Ben-ItzhakProf. Israel CidonDr. Avinoam Kolodny
Department of Electrical Engineering, Technion – Israel Institute of Technology 1
Performance and Power Aware CMP Thread Allocation
max PerformancePower
Thread
Allocation
2
L2L1
L1
L1
L1
L1
L1
L1
L1
L1
Thre
ads
Performance maximization Use all the coresHigh power consumption
Power MinimizationSingle coreLow performance
$
Core
Router
Shared Cache
3
Performance Power Trade Off
Performance Power Metric (PPM)
Less Power ↔ More Performance( smaller α( )larger α )
4
PerformancePower
Preferred tradeoff between
performance and power .
In Short PPM
1
Outline
Performance and Power Model
Thread Allocation
Numerical Results
5
Simplified Performance ModelSingle coarse-grain multi-threaded core
* "Performance tradeoffs in multithreaded processors”, A. Agarwal IEEE Transactions on Parallel and distributed Systems, 1992
6
The model is an extension of Agarwal`s modelfor asymmetric threads
For simplicity we assume: No Sharing Effect
Miss-rate doesn`t depend on the number of threads Holds for small number of threads and large private
cache Miss rate & total memory access don`t vary by
time No context switch overhead
Thread i runs δi clocks until it suffers a L1 cache miss
T - Clocks to fetch from the shared cache
Terminology Single coarse-grain multi-threaded core
δ1
Idle Time
T
1 2 1 2 . . .
δ2
δ1
Thread 1 Cache miss
Thread 2 Cache miss
Thread 1 Cache
response
Thread 2 Cache
response
T
T=h·t+ TL2 $ access time
Hops Hop-Latency
7
$
t clocks latency
Memory Bound Case
8
When
max
kk
iiT
Core Utilization
Thread i Performance maxi
iiT
Each thread got executed every
maxk iik
T
δ1
Idle Time
T
1 2 1 2 . . .
δ2
δ1
Thread 1 Cache miss
Thread 2 Cache miss
Thread 1 Cache
response
Thread 2 Cache
response
T
max iT
. . . . . . . . . .δ1 T
1 2 1 2 . . .
δ2
δ1
T
. . . . . . . . . . M
Thread 1 Cache miss
Thread 2 Cache miss
Thread 1 Cache
response
Thread 2 Cache
response
maxk iik
T i
kk
Each thread is executed every clocks
kk
9
1 Core Utilization
Thread i PerformanceWhen
→ Saturation
CPU Bound Case
Saturation
1 Hop SaturationThreshold
Performance Per Thread
max
max
ik cache access time iiksat
i jk cache access time iikk
k
hops Tth
Thread Performancehops T
t
t
Saturation
1 Hop
Performance Per Thread
2 Hops
max
max
ik cache access time iiksat
i jk cache access time iikk
k
hops Tth
Thread Performancehops T
t
t
SaturationThreshold
Saturation
1 Hop
max
max
ik cache access time iiksat
i jk cache access time iikk
k
hops Tth
Thread Performancehops T
t
t
Performance Per Thread
2 Hops
Mor
e Ho
ps
SaturationThreshold
Core power consumption:
- Power consumption of a fully utilized core - Idle core power consumption
Power Model
(1 ) ; 0< 1Core Power
0 ; =0active idleP P
activeP
idleP
13
Outline
Performance and Power Models
Thread Allocation
Numerical Results
14
The Thread Allocation Problem Given:
CMP Topology composed of M identical cores P applications each with Ti symmetric threads )1≤i≤P(. α – Preferred tradeoff between performance
and power.
Find thread allocation:
which maximizes
( )cin
core index
application index
15PPM
For simplicity:1) We assume that ni
(c) is continuous 2) Perform result DiscretizationAverage Thread Performanceα
Power α≥1
Minimum Utilization (MU)
Activating a core increases the power consumption by at least Pidle
In order to justify operation of a core, an appropriate increase of the performance is required.
PerformancePower
MU is the Minimum Utilization which justifies operating a core.
16
(1 ) ; 0< 1Core Power
0 ; =0active idleP P
Reminder:
Minimum Utilization (MU)Calculation
17
Compare the PPM value of two cases:
Threads are executed by: m over-saturated
cores
Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU
PPM1st case PPM2nd case
=
1st case: All threads are executed by single over saturated-core
Threads
Minimum Utilization (MU)Calculation
First core inthreshold saturation
First core inthreshold saturation.
Utilization of the second core equals MU
Power increases
by Pidle
x 104
m=12nd case: The first core in threshold saturation and the remaining threads are executed by the second core.
PPM1st case = PPM2nd case
18
PP
M(
MIP
S2 /P
ower
)
Minimum Utilization (MU)Calculation
19
Compare the PPM value of two cases:
Threads are executed by: m over-saturated
cores
Threads are executed by:m cores in exactly threshold saturation and the )m+1(th core utilization equals MU
1st case 2nd case
=
11
1 mm active idle
active idle
m m MUMU P m P
P P m
Minimum Utilization (MU)Approximated Value and α Dependency
1idle
active idle
PMUP P
PerformancePower
Power is more important.Operate a core
only if it`s highly utilized
Performance is more important.
Operate a core even if its utilization is low
20Less Power ↔ More Performance
Min
imum
Util
izat
ion
)%(
α
The Thread Allocation AlgorithmHighlights Iterative.
In each Iteration: Threads with highest cache miss rate are allocated on the closer core to the shared cache until at most threshold saturation.
Operating a core only if MU Threshold is achieved.
Iterative
Threshold
Algorithm ITA 21
Outline
Performance and Power Model
Thread Allocation Problem
Numerical Results
22
How to evaluate ITA PPM? Compare average PPM values
ITA Equal Utilization Optimization algorithms
Scenarios 2-8 cores and 2-8 applications Using the following distributions:
23
( )(10,30) Cycles, (10,40) Cycles, (1,8) Hops(1,40) Cycles, T (1, 25) Threads, (1,6)
ccache
i i
T U U h UU U U
t
4.7 Cores
7.9 Cores
7.2 Cores
)5,8(
3.6 Cores
)2,5(
24
Average
Improvement of
47%
Equal Utilization Comparison
ITA
Equal Utilization
PPMPPM
The best PPM of: Constrained Nonlinear Optimization Pattern Search Algorithm Genetic Algorithm
These methods were run for 10,000X longer than ITA
Comparison with Optimization Methods
25
Optimization Methods Compare
ITA: 4.7 CoresOpt. Methods: 7.1 Cores
ITA: 3.6 CoresOpt. Methods: 4.6 Cores
)2,5(
)5,8(
ITA: 7.1 CoresOpt. Methods: 7.9 Cores
ITA: 7.9 CoresOpt. Methods: 8 Cores
26
Average Improvement of
9% max
ITA
Optimization Method
PPMPPM
Applications Cores
Summary Tunable Performance Power Metric Minimum Utilization concept Approach for low computational thread
allocation on CMP Future work:
Extension for distributed cache Threads and data co-allocation
Sharing Effect Consideration Heterogonous CMPs
27
Questions ?
28
Backup
29
Performance Power Metric Follows definitions used in logic circuit
design. If E is the energy and t is the delay,
Penzes & Martin introduces the metric E•tα, where α becomes larger as the performance becomes more important.
30* “Energy-delay efficiency of VLSI computations”, Penzes, P.I., Martin A.J 12th ACM Great Lakes Symposium on VLSI, 2002.
PerformancePPMPower
Minimum Utilization (MU)Calculation Cont.
MU value depends on how many cores are already operating.
For large enough value of m, MU value is constant….
Approximate constant value is reasonable…)Keep it Simple…(
MU of (m+1)th core
31
α=1
α=1.5
α=2α=2.5
α=3
11
1 mm active idle
active idle
m m MUMU P m P
P P m
MU vs. Pidle/Pactive
32
α=1
α=1.2
α=1.4α=1.6
α=1.8α=2
Pidle/Pactive
Min
imum
Util
izat
ion
)%(
Previous Work
35
Their Work My Work Ding et al. from
Pennsylvania State University
Thread allocation based on integer
linear programming that maximizes a
performance/power metric on single-threaded cores
Maximizes tunable metric
performanceα/power
Dealing with multi-threaded cores.
Miao et al. from Jiaotong University
First maximize performance and
then minimize power consumption. Using Genetic and Greedy
algorithms.
Optimizes performance and
power simultaneously
Previous Work Neglecting Sharing Effect
Fedorova et al. "Chip multithreading systems need a new operating system scheduler“ Its goal is to highly utilize the cores. Tries to pair high-IPC tasks with low-IPC tasks in order to reduce the
pipeline resource contention. Neglect the sharing effect among threads )Similar to my research…(
Doesn`t take into account varying distances of cores from the L2 shared cache .Doesn`t consider the power consumption.
36
Discritization There are a lot of discritization methods …
Use Histogram Specification method )image processing(
1
(1) (1) ( ) ( ) ( )
1 1
: , ; 2 c Mc c
c k ki i i i i
k k
i m round n m round round n m
37
Results DiscretizationExample
Num
ber o
f Thr
eads
Core Hops Distance
DC
DDD
C
D
C CC
1
(1) (1) ( ) ( ) ( )
1 1
: , ; 2 c Mc c
c k ki i i i i
k k
i m round n m round round n m
On average, results discretization
reduces PPM value by 5%
.
38
Results Discretization
39
InitializeCurrent core=Closer core to the shared cache
Current application= Application with highest miss rate
Allocate threads of current application over current core
until at most threshold saturation
All the threads of current application were
allocated? Last application?
Current application = Unallocated application with highest cache miss
Finish
All unallocated
threads achieve MU on the next
available closer core?
Current core = The next available closer core to the shared cache
Allocate all remainingthreads over already
operating cores)over saturation(
Finish
Current core at saturation threshold?
Last core ?
Yes,Not last application
Yes,Last application
,Yes. Not last core
Yes
No
No
No ,Yes.
Last core
ITA consuming on average 0.01% and at most 2.5% of the minimum of computational effort required by the
optimization methods.It outperforms the best of
optimization methods by 9%.
Time Complexity ComparisonOptimization methods to ITA Ratio
41
ITA Operations / Minimum of Optimization Methods Operations
ApplicationsCores
42
,
1i
m i ir mr
Ratio of memory access instructions out of the total instruction
mix of thread i
Cache miss rate for thread i