Applica’onModelingandHardware Par’’oningMechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’onModelingandHardware* Par’’oningMechanismsfor*

Applica'on Modeling and Hardware Par''oning Mechanisms for

Resource Management Sarah Bird, Henry Cook, Krste Asanovic, John Kubiatowicz,

Dave PaFerson

ParLab Arch and OS Groups

Resource Alloca'on Objec'ves

•  Each par''on receives a vector of basic resources dedicated to it –  Some number of processing elements (e.g., cores)

–  A por'on of physical memory

–  A por'on of shared cache memory

–  A frac'on of memory bandwidth

•  Allocate minimum resources necessary for each applica'ons QoS requirements

Cell1

Cell2

Cell3

Resource Sub-‐Group

Resource Group

Cell3

•  Allocate remaining resources to meet some system-‐level objec've

–  Best performance

–  Lowest Energy •  Doesn’t require applica'on developers to worry about low-‐level resources

Tessellation K

ernel (Trusted)

System-‐wide Adapta'on Loop

MicrosoV/UPCRC

Policy Service

STRG Validator & Resource Planner

Partition Mapping and Multiplexing

Layer

Resource Plan Executor

Partition Mechanism

Layer QoS

Enforcement Partition

Implementation Channel

Authenticator

Partitionable Hardware Resources

Cores Physical Memory

Memory Bandwidth

Cache/ Local Store Disks NICs

Major Change Request

ACK / NACK

Cell-Creation & Resizing Requests from Users Admission

Control

Minor Changes

Cell Cell Cell

All system resources

Cell group with frac'on of resources

Cell

Space-‐Time Resource Graph

(STRG)

(Current Resources)

Global Policies / User Policies and Preferences

Resource-Allocation and Adaptation

Mechanism

Offline Models and Behavioral

Parameters

Online Performance Monitoring,

Model Building, and Prediction

Performance Counters

ACK / NACK

APPLICATION MODELING Techniques and Tradeoffs

Mo'va'on

• Programmers are unlikely to know exactly how low-‐level resources effect performance

– Developers are concerned applica'on-‐level metrics •  e.g., frames/sec, requests/sec

– Opera'ng system has to make decisions about resource quali'es

•  e.g., number of cores, cache slices, memory bandwidth

• Automa'cally construc'ng performance models is a good way to bridge the gap between applica'on-‐level metrics and hardware resources

Model Building

•  Collect data points of performance for specific resource alloca'on vectors

Li = PMi(r(0,i), r(1,i), …, r(n-‐1,i)) •  Use mul'variate regression techniques to fit a model to the data points

Linear and Quadra'c GPRS KCCA

• Advantages • Very Accurate

• Disadvantages • Can overfit the data • Computa'onally expensive to build • Doesn’t work with simple op'mizers

• Advantages • Can represent all output metrics in one model • Successful in the past

• Disadvantages • Doesn’t work with simple op'mizers

• Advantages • Simple to build • Work well with simple op'mizers

• Disadvantages • Poten'ally inaccurate • Can’t represent variable interac'on

Model Accuracy

Model Crea'on Online vs. Offline Training

• Offline profiling op'ons – Profile applica'ons in advance

•  Distribute with applica'on •  iTunes App Store or Android Market

– Create applica'on profiles in the Cloud •  Record performance and resource sta's'cs from users

•  MSR is currently doing this to make perf. models for app developers

• Online profiling op'ons – Install 'me profiling

•  Opera'ng system tests out a variety of configura'ons

– Online refinement of models •  Opera'ng system starts with a generic model

•  Retrains the model with new informa'on as the applica'on runs

Performance Isola'on • Without performance isola'on,

–  An applica'ons performance could vary widely as a result of concurrently running applica'ons

– Inaccurate models – Requires different models of the applica'on based on the system load

• Performance predictability is an important component for applica'on modeling

– Other advantages •  BeFer tuned, more efficient applica'ons

•  Easier to make QoS guarantees

RESOURCE ALLOCATION USING CONVEX OPTIMIZATION

An Example

Minimizing the Urgency of The System

La = PMa(r(0,a), r(1,a), …, r(n-‐1,a)) La

Ua(La)

Con'nuously Minimize (subject to

restric'ons on the total amount of

resources)

Lb = PMb(r(0,b), r(1,b), …, r(n-‐1,b)) Lb

Ub(Lb)

Li = PMi(r(0,i), r(1,i), …, r(n-‐1,i)) Li

Ui(Li)

Performance Model Urgency Func'on

[Burton Smith (MSR), Opera'ng System Resource Management (Keynote), IPDPS 2010]

Li = PM(r(0, i), r(1, i), …, r(n-‐1, i))

Li

Ui (Li)

Service Requirement

si = slope

Ui (Li) = MAX(si ·∙ (Li -‐ di), 0)

Urgency Func'on

• Reflects the importance of cell Ci to the user

di


Performance Model (PM) Expected to decrease with resources

r(1,i): Alloca'on of resource of type 1 to Cell Ci

Li

Ui (Li)

si = slope

Performance-‐Aware Convex Op'miza'on for Resource Alloca'on

•  Advantages •  Convex op'miza'on is relaJvely inexpensive op'miza'on problem with a single extreme point

•  Urgency Func'on Slopes allow the system to express relaJve prioriJes of applica'on

•  Priori'es change as a func'on of performance

•  Urgency Func'on Intercept encapsulates QoS requirements

•  And addi'onal process can be used to represent system energy


Very sensi've to the performance models of the applica'ons

HARDWARE PARTITIONING Approaches + Mechanisms

GSFm: Globally Synchronized Frames Bandwidth Par''oning for the memory hierarchy

•  Frame-‐Based QoS System – Transac'ons are labeled with a frame number

– Head frame moves through the network with a top priority

A

A VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)

C B VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)

Proc A Proc B Proc C Proc D

B C

B A D

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

frame window

shift

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5

with Jae Lee, MIT

GSFm: Globally Synchronized Frames Bandwidth Par''oning for the memory hierarchy

•  Uses source-‐side suppression •  Applica'ons are given a bandwidth alloca'on per frame

–  Credits per resource •  Memory channels and memory bank,

network link, etc – Memory transac'ons are charged for all possible resources

•  Delayed into a future frame if the app doesn’t have enough credits

off-‐chip DRAMs 0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

dir 0

mem ctrl 0 dir 1

mem ctrl 1

off-‐chip DRAMs

1

Networks Memory

channel bank c2hREQ h2cRESP h2cREQ c2cRESP

1H 1P 0 0 1T 1T

H: header-‐only message P: header+payload message T: memory transac'on

Advantages of GSFm

• Minimum Bandwidth Guarantees – Flexible – Differen'ated – Weighted sharing of the excess bandwidth

• Good U'liza'on – Early frame reclama'on – Excess bandwidth

• Minimal Hardware Requirements – Reasonable area – Distributed 1234567812

3456

78

0

0.02

0.04

0.06

accepted throughput [flits/cycle/node]

0.02!

0.04!

0.06!

0!

Cache Par''oning

•  Way-‐Based –  Simple Indexing –  Changes the replacement policy –  Reduced associa'vity –  Limited by number of ways –  No locality for NUCA systems

•  Bank-‐Based –  Locality in NUCA –  More complex indexing –  Requires flush on reconfigura'on –  Limited by number of banks

A SIMPLE EVALUATION RAMP Gold Experiments

Experimental Planorm

•  RAMP Gold: FPGA-‐Based Simulator Target Machine

–  64 single-‐issue in-‐order cores @ 1GHz

•  Par''onable into sets of 8 –  Private L1 Instruc'on and Data Caches each 32KB –  Shared L2 Cache 8MB inclusive 10 ns latency

•  Par''onable into 8 slices using page coloring

– Memory bandwidth with magic interconnect

•  Par''onable into 3.4 GB/s units assigned to a set of cores

•  ROS Kernel Code – Microbenchmarks & PARSEC Benchmarks

Applica'on Modeling

•  Use 10 sample points – 18.5% of the 54 Possible Alloca'ons – Selected using Audze-‐Eglais Design of Experiments

•  Create a model of the applica'on –  Input: Resources – Output: Performance

•  Explore different types of models –  Linear – Quadra'c – KCCA (Machine Learning) – Gene'cally Programmed Response Surfaces (GPRS)

•  Run all 54 Alloca'ons to test model accuracy Latency

Memory alloca'on

Scheduling Experiment

•  Evaluate Objec've Func'on for a Pair of Benchmarks using the models

–  Race to Halt •  Min Max(Cycle1, Cycle2)

–  Least Cycles •  Min (Cycle1*Cores1 + Cycle2*Cores2)

–  Lowest Energy •  Min Σi (resource u'liza'on i*energy parameter i)

– Using MATLAB’s fmincon

•  Run all 54 possible resource alloca'ons for each pair of benchmarks

–  Assumes that all resources must be allocated

Spa'al Par''oning Results

•  Time-‐Mux’ing is on average of 2x worse than the best spa'al par''on

•  However the worst spa'al par''on is quite bad.

•  Naively dividing the machine in half is 1.75x worse than the best spa'al par''on

•  Linear model is within 8% of op'mal every 'me

•  Quadra'c Model is within 3% of op'mal every 'me

•  KCCA does well in some cases but very poorly in others

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

Blackscholes and Streamcluster

Bodytrack and Streamcluster

Blackscholes and Fluidanimate

Loop Micro and Random Access Micro

Sum of C

ycles on

All Co

res (Normalized

to Best)

Best Spa'al Par''oning Time Mul'plexing

Divide the Machine in Half Predic'on with Linear Model

Predic'on with Quadra'c Model Predic'on with KCCA Model

Worst Spa'al Par''oning

(7.15)

Conclusions

•  Simple applica'on models show promise – S'll lots of challenges

•  When to build?

•  How to store? •  Variability

• Resource alloca'on using convex op'miza'on poten'ally very interes'ng

– Lots of parameters to tune •  How do we set the urgency func'ons?

– How does it compare with other op'ons?

QUESTIONS?

Documents

Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’on*Modeling*and*Hardware* Par’’oning*Mechanisms*for*

Applica’onModelingandHardware Par’’oningMechanisms ...parlab.eecs.berkeley.edu/sites/all/parlab/files/june2010-retreat.pdf · Applica’onModelingandHardware* Par’’oningMechanismsfor*