1
DyPO: Dynamic Pareto-Optimal Configuration Selection for Heterogeneous MpSoCs 1. Motivation Personalized Social Computing 2. DyPO Overview and Illustration Ujjwal Gupta, Chetan A. Patil, Ganapati Bhat, Umit Y. Ogras Arizona State University Prabhat Mishra University of Florida § Smart phones have a highly heterogeneous processor § Power and performance change as a function of: 1. Configurations 2. Workloads § Good Power-Performance tradeoff does not mean good Energy-Performance tradeoff § Dynamically selecting the optimal config is challenging Number of configuration knobs 4004 for Samsung Exynos 5422 BB_PAPI_read() BB1 BB2 BB1M BB_PAPI_read() BB(1M + 1) BB(1M + 2) BB2M BB_PAPI_read() Find optimal Config ( ) {, , , } Find = (boundary) #LLC-misses #instructions-retd. ≈ ( ) Instrumented Apps LLVM Clang Instrumentation Phase-1 =10M =10k Phase-2 =10M =1k Input () Phase-3 = 10M = 9k Step 1: Instrumentation {, , , } Output ( ) Online New App = #Instr.-retd. = #LLC-misses Phase-1 Phase-2 Step 2: Characterization Multiple Apps Start: BB1 BB2 BB100M End A1 A2 A3 Example: Phase-3 is similar to Phase-1 Use the same optimal configuration as Phase-1 Step 3: Classification {, , , } Classifier New Phase P = { } Step 4: Online Selection Offline 3. Characterization and Classification Personalized Social Computing 4. Experiment Setup Personalized Social Computing 5. Experimental Results Personalized Social Computing 6. Conclusions References Blackscholes application – 2T k-means clustering Feature data ( ) Optimal config ( ) Store > Training Solve () offline to get model coeff. Find ( = |= ) to solve for Read new feature data ( ) and stored At runtime H/W counters CPU stats Solve for ( ) Sweep little cores ( ) Sweep frequencies ( ) Repeat benchmark ( ) × × × Sweep big cores ( ) Configuration Data per Benchmark Number of big cores Number of little cores Number of frequencies Number of iterations Little core frequency Big core frequency § 3 iterations of each benchmark at different configuration § Profile 18 benchmark, leading to 4467 distinct workload snippets § Time spent for one benchmark is about 1-2 hours Characterization data J = min N O [ J , J + VWV ( J , J )] = [ Pr( = J | = J ) ^ J_‘ Pr = J = J = O 1+ O Objective Likelihood Conditional Probability Read feature data ( ) Pr = c = J = d O 1+ d O Compute the conditional probability for each class 1, c J c J Overhead of 20 Find most likely class , ( = | = ) § Use the model coefficient matrix stored in the platform (~282 bytes) § The conditional probability is computed at runtime for each of the classes § The class with the highest probability is assigned to the system Sysfs interface Power Frequency Core Config Utilization Perf Driver CPU Governor Output Data For Analysis Benchmark DyPO PAPI Calls Kernel Space User Space PMU § Odroid-XU3 board § Exynos 5422 § Octa-core CPU § Ubuntu OS § Linux Kernel v3.10 § 18 Applications Level-1 Level-2 r s t u § Our technique successfully finds the Pareto-optimal configurations at runtime as a function of the workload [1] § DyPO-Energy achieves 25% and 55% gain in PPW compared to Aalsaud*- Offline and Aalsaud*-ADA [2], respectively § Experiments show 93%, 81% and 6% larger performance per watt (PPW) compared to the interactive, ondemand and powersave governors § The details can be found in [1]. Single Threaded Application Multi Threaded Application BIG CLUSTER LITTLE CLUSTER CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3 SAMSUNG EXYNOS 5422 Multiple Applications Acknowledgement Classifier Accuracy Level 1 Level 2 60% Training Validation 99.9 % 92.7 % 5-Fold Cross Validation 99.9 % 80.5 % This work was supported partially by National Science Foundation (NSF) grants CNS-1526562 and CNS-1526687, and Semiconductor Research Corporation (SRC) task 2721.001. [1] Ujjwal Gupta, Chetan Arvind Patil, Ganapati Bhat, Prabhat Mishra, and Umit Y. Ogras. “DyPO: Dynamic Pareto Optimal Configuration Selection for Heterogeneous MpSoCs,” in ACM Tran. on Embedded Comp. Sys. (ESWEEK Special Issue), October 2017 [2] A. Aalsaud et al., “Power–Aware Performance Adaptation of Concurrent Applications in Heterogeneous Many-Core Systems,” in Proc. of the Intl. Symp. on Low Power Elec. and Design, 2016, pp. 368–373.

DyPO: Dynamic Pareto-Optimal Configuration Selection for

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DyPO: Dynamic Pareto-Optimal Configuration Selection for

DyPO: Dynamic Pareto-Optimal Configuration Selection for Heterogeneous MpSoCs

1. Motivation

Personalized Social Computing

2. DyPO Overview and Illustration

Ujjwal Gupta, Chetan A. Patil, Ganapati Bhat, Umit Y. OgrasArizona State University

Prabhat MishraUniversity of Florida

§ Smart phones have a highly heterogeneous processor § Power and performance change as a function of:

1. Configurations2. Workloads

§ Good Power-Performance tradeoff does not mean good Energy-Performance tradeoff

§ Dynamically selecting the optimal config is challenging

Numberofconfigurationknobs4004forSamsungExynos 5422

BB_PAPI_read()BB1BB2

BB1MBB_PAPI_read()

BB(1M + 1)BB(1M + 2)

BB2MBB_PAPI_read()

Find optimal Config (𝒄𝒌∗ ){𝟐𝐋, 𝟏𝐆𝐇𝐳,𝟑𝐁, 𝟏𝐆𝐇𝐳}

Find 𝒇 𝒑𝒌 = 𝒄𝒌∗(boundary)

#LLC-misses #ins

truc

tions

-ret

d.

𝒄𝒌∗ ≈ 𝒇()

Instrumented AppsLLVM

Cla

ngIn

stru

men

tatio

n

Phase-1𝒙𝟏=10M𝒙𝟐=10k

Phase-2𝒙𝟏=10M𝒙𝟐=1k

Input (𝒙)Phase-3𝒙𝟏= 10M𝒙𝟐= 9k

Step 1: Instrumentation

{𝟒𝐋, 𝟐𝐆𝐇𝐳,𝟒𝐁, 𝟐𝐆𝐇𝐳}

Output (𝒄𝒌∗ )

OnlineNew App

𝒙𝟏= #Instr.-retd. 𝒙𝟐= #LLC-misses

Phas

e-1

Phas

e-2

Step 2: Characterization

Multiple Apps

Start:BB1BB2

BB100M End

A1A2

A3

……

Example:

Phase-3 is similar to Phase-1 Use the same optimal configuration as Phase-1

Step 3: Classification

{𝟐𝐋, 𝟏𝐆𝐇𝐳, 𝟑𝐁, 𝟏𝐆𝐇𝐳}

Classifier

∀𝒌New Phase P𝑪𝒏𝒆𝒙𝒕 = 𝒇{𝒑𝒌}

Step 4: Online Selection

Offline

3. Characterization and Classification

Personalized Social Computing

4. Experiment Setup

Personalized Social Computing

5. Experimental Results

Personalized Social Computing

6. Conclusions

References

Blackscholes application – 2T

k-means clustering Feature data (𝒙𝒌)

Optimal config (𝒄𝒌∗ )

Store 𝜷

𝒙𝒏

𝒙 𝒏>𝟏 𝒇 𝜷

Training Solve 𝒍(𝜷)offline to get model coeff. 𝜷

Find 𝑷𝒓(𝐂 = 𝒄𝒌∗ |𝐗 = 𝒙𝒌)to solve for 𝒄𝒌∗

Read new feature data (𝒙𝒌) and stored 𝜷

At runtimeH/W countersCPU stats

Solve for 𝑱(𝒑𝒌)

Sweeplittlecores(𝒏𝒍)

Sweepfrequencies(𝒏𝒇)

Repeatbenchmark(𝒏𝒓)

𝒏𝒃×𝒏𝒍×𝒏𝒇×𝒏𝒓

Sweepbigcores(𝒏𝒃)

ConfigurationDataperBenchmark

𝒏𝒃 Numberofbigcores𝒏𝒍 Numberoflittlecores𝒏𝒇 Numberoffrequencies𝒏𝒓 Numberofiterations𝒇𝒍 Littlecore frequency𝒇𝒃 Bigcore frequency

§ 3 iterations of each benchmark at different configuration

§ Profile 18 benchmark, leading to 4467 distinct workloadsnippets

§ Time spent for one benchmark is about 1-2 hours

Characterization data

𝐽 𝑝J = minNO[𝐸 𝑐J, 𝑝J + 𝜇𝑡VWV(𝑐J, 𝑝J)]

𝑙 𝛽 = [Pr(𝐂 = 𝑐J∗|𝐗 = 𝒙J)^

J_`

Pr 𝐂 = 𝑐J∗ 𝐗 = 𝒙J =𝑒𝜷𝒙O

1 + 𝑒𝜷𝒙O

Objective

Likelihood

Conditional Probability

Read

featureda

ta(𝒙

𝒌)

Pr 𝐂 = 𝑐c∗ 𝐗 = 𝒙J =𝑒𝜷d𝒙O

1 + 𝑒𝜷d𝒙O

Computetheconditionalprobabilityforeachclass𝑖 ∈ 1, 𝑁

𝜷c 𝒙J𝜷c𝒙J

Overheadof20𝜇𝑠

Find

mostlikelyclass𝒊,

𝒂𝒓𝒈𝒎

𝒂𝒙 𝒊𝐏𝐫(𝐂=𝒄 𝒊∗|𝐗=𝒙 𝒌)

§ Usethemodelcoefficientmatrix𝜷storedintheplatform(~282bytes)

§ Theconditionalprobabilityiscomputedatruntimeforeachofthe𝑵 classes

§ Theclasswiththehighestprobabilityisassignedtothesystem

Sysfs interfacePower

FrequencyCore Config

UtilizationPerf

Driver

CPUGovernor

Output D

ata For A

nalysis

Benchmark

DyPO

PAPI Calls

Kernel Space User Space

PMU

§ Odroid-XU3board§ Exynos 5422§ Octa-coreCPU§ UbuntuOS§ LinuxKernelv3.10§ 18Applications

Level-1

Level-2𝑐r∗𝑐`∗

𝑐s∗𝑐t∗ 𝑐u∗

§ Our technique successfully finds the Pareto-optimal configurations at runtime as a function of the workload [1]

§ DyPO-Energy achieves 25% and 55% gain in PPW compared to Aalsaud*-Offline and Aalsaud*-ADA [2], respectively

§ Experiments show 93%, 81% and 6% larger performance per watt (PPW) compared to the interactive, ondemand and powersave governors

§ The details can be found in [1].

SingleThreadedApplication

MultiThreadedApplication

BIG CLUSTER LITTLE CLUSTER

CORE 0 CORE 1

CORE 2 CORE 3

CORE 0 CORE 1

CORE 2 CORE 3

SAMSUNG EXYNOS 5422

MultipleApplications

Acknowledgement

Classifier Accuracy Level 1 Level 2

60% Training Validation 99.9 % 92.7 %

5-Fold Cross Validation 99.9 % 80.5 %

This work was supported partially by National Science Foundation (NSF) grants CNS-1526562and CNS-1526687, and Semiconductor Research Corporation (SRC) task 2721.001.

[1] Ujjwal Gupta, Chetan Arvind Patil, Ganapati Bhat, Prabhat Mishra, and Umit Y. Ogras. “DyPO: Dynamic Pareto Optimal Configuration Selection for Heterogeneous MpSoCs,” in ACM Tran. on Embedded Comp. Sys. (ESWEEK Special Issue), October 2017[2] A. Aalsaud et al., “Power–Aware Performance Adaptation of Concurrent Applications in Heterogeneous Many-Core Systems,” in Proc. of the Intl. Symp. on Low Power Elec. and Design, 2016, pp. 368–373.