Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
MotivationSystem architectureProgram distillation
Experimental Analysis
Illusionist: Transforming Lightweight Coresinto Aggressive Cores on Demand
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. Mahlke
HPCA - 2013
University of Illinois University of Michigan
June 28, 2013
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand
MotivationSystem architectureProgram distillation
Experimental Analysis
MotivationCMPs(w/ many lightweight cores) achieve good energy efficiencyand throughput
Single-thread performance stagnates or even becomes worse
Asymmetric CMPs (w/ aggressive cores) can accelerate threadsThreads needs to be mapped/migrated efficiently to these coresFixed in design time so less flexible and less adaptive in runtime
Proposed solution: Rather than using aggressive cores toaccelerate threads, use the aggressive core to accelerate largenumber of lightweight cores and provide an illusion as if thewhole chip is full of aggressive cores, if needed
Aggressive core executes distilled version of programs to providecache and branch hints to respective lightweight core
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 2
MotivationSystem architectureProgram distillation
Experimental Analysis
Illusionist: High Level View
To maximize single-thread performance, aggressive cores shouldnot be used to accelerate individual threads, but rather shouldredundantly execute threads running on the lightweight cores
Single-thread: 35% over LWC; Throughput: 2x over AC
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 3
MotivationSystem architectureProgram distillation
Experimental Analysis
Acceleration Opportunities
Lightweight core: EV4, 2-issue in-order
Aggressive core: EV6, 6-issue out-of-order
With perfect hints, EV4(OoO) can outperform EV6(6-issue OoO)
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 4
MotivationSystem architectureProgram distillation
Experimental Analysis
Illusionist: Core Coupling Architecture
Base case architecture: connecting an aggressive and a
lightweight core together.
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 5
MotivationSystem architectureProgram distillation
Experimental Analysis
Illusionist: System Overview
CMP level architecture: in a 44-core system one AC is shared
among 10 LWC in two ring based network
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 6
MotivationSystem architectureProgram distillation
Experimental Analysis
Program Distillation TechniquesAggressive distillation: so that AC can generate hints for multiple
LWC through multithreading-based, time-multiplexed executionInstruction removal:
Keep back-slices of branches and memory instructionsSliding window: top inst is removed if it doesn’t produces any valuewhich drives branches or memory insts inside window (?)Highly biased branches: 90% biased are not helped by AC and LWCuses its own branch predictor for such branchesUnnecessary cache hints: If the loads and stores access theaddresses in the same cache line then stores are removed
This analysis can be done offline; they do it using DynamicRIO –
a dynamic, just-in-time compiler with overhead of less than 1%
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 7
MotivationSystem architectureProgram distillation
Experimental Analysis
Phase-Based Program SelectionIdea is to predict the phases w/o actually running the program onboth lightweight and aggressive cores
Then limit the dual-core execution to the most useful phases
A liner regression model is used to predict the IPC improvementin next epoch based on L1 misses, and branch mispredictions
The LWC which benefits most gets the AC to generate its hints
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 8
MotivationSystem architectureProgram distillation
Experimental Analysis
Example of Distilled Program
Code from 179.art: First statement is rare, else clause is highly
biased so turned into unconditional
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 9
MotivationSystem architectureProgram distillation
Experimental Analysis
Experimental SetupPerformance: Modified SimAlpha
Spec-2k with SimPoint
Power: Wattch, HotLeakage, and CACTI
Area: Synopsys toolchain + 90nm TSMC
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 10
MotivationSystem architectureProgram distillation
Experimental Analysis
Instructions Removed
Percentage of instructions removed from the original programs
On an average, 76% of the instructions are removed whenperforming the analysis on a sliding window of size 100K insts.
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 11
MotivationSystem architectureProgram distillation
Experimental Analysis
Accuracy of Generated HintsLarger window results into more aggressive instruction removal
This comes at a higher loss of accuracy
For a 10K window size (used in the final evaluation) the accuracy
is 79% compared to perfect program execution
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 12
MotivationSystem architectureProgram distillation
Experimental Analysis
Breakdown of Instructions
In most applications, the breakdowns are similar
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 13
MotivationSystem architectureProgram distillation
Experimental Analysis
Performance After Acceleration
On an average, 35% better single-thread performance
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 14
MotivationSystem architectureProgram distillation
Experimental Analysis
Area-Neutral Comparison of Alternative
Final system: After phase-based pruning
35% better single-thread performance compared to LWC
2X better throughput compared to AC
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 15
MotivationSystem architectureProgram distillation
Experimental Analysis
Summary
Instead of using aggressive cores to execute the bottleneck
threads, they are used to enable lightweight cores to achieve high
single-thread performance w/o sacrificing throughput
Aggressive core is shared among multiple lightweight cores
This requires aggressive program distillationPrograms are distilled by static, dynamic and phase based pruning
Finally, it is argued that using the AC to generate hints for LWC is
better option than off-loading LWC’s work to AC
A. Ansari, S. Feng, S. Gupta, J. Torrellas, S. MahlkeIllusionist: Transforming Lightweight Cores into Aggressive Cores on Demand 16