Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Applica'on Modeling and Hardware Par''oning Mechanisms for
Resource Management Sarah Bird, Henry Cook, Krste Asanovic, John Kubiatowicz,
Dave PaFerson
ParLab Arch and OS Groups
Resource Alloca'on Objec'ves
• Each par''on receives a vector of basic resources dedicated to it – Some number of processing elements (e.g., cores)
– A por'on of physical memory
– A por'on of shared cache memory
– A frac'on of memory bandwidth
• Allocate minimum resources necessary for each applica'ons QoS requirements
Cell1
Cell2
Cell3
Resource Sub-‐Group
Resource Group
Cell3
• Allocate remaining resources to meet some system-‐level objec've
– Best performance
– Lowest Energy • Doesn’t require applica'on developers to worry about low-‐level resources
Tessellation K
ernel (Trusted)
System-‐wide Adapta'on Loop
MicrosoV/UPCRC
Policy Service
STRG Validator & Resource Planner
Partition Mapping and Multiplexing
Layer
Resource Plan Executor
Partition Mechanism
Layer QoS
Enforcement Partition
Implementation Channel
Authenticator
Partitionable Hardware Resources
Cores Physical Memory
Memory Bandwidth
Cache/ Local Store Disks NICs
Major Change Request
ACK / NACK
Cell-Creation & Resizing Requests from Users Admission
Control
Minor Changes
Cell Cell Cell
All system resources
Cell group with frac'on of resources
Cell
Space-‐Time Resource Graph
(STRG)
(Current Resources)
Global Policies / User Policies and Preferences
Resource-Allocation and Adaptation
Mechanism
Offline Models and Behavioral
Parameters
Online Performance Monitoring,
Model Building, and Prediction
Performance Counters
ACK / NACK
APPLICATION MODELING Techniques and Tradeoffs
Mo'va'on
• Programmers are unlikely to know exactly how low-‐level resources effect performance
– Developers are concerned applica'on-‐level metrics • e.g., frames/sec, requests/sec
– Opera'ng system has to make decisions about resource quali'es
• e.g., number of cores, cache slices, memory bandwidth
• Automa'cally construc'ng performance models is a good way to bridge the gap between applica'on-‐level metrics and hardware resources
Model Building
• Collect data points of performance for specific resource alloca'on vectors
Li = PMi(r(0,i), r(1,i), …, r(n-‐1,i)) • Use mul'variate regression techniques to fit a model to the data points
Linear and Quadra'c GPRS KCCA
• Advantages • Very Accurate
• Disadvantages • Can overfit the data • Computa'onally expensive to build • Doesn’t work with simple op'mizers
• Advantages • Can represent all output metrics in one model • Successful in the past
• Disadvantages • Doesn’t work with simple op'mizers
• Advantages • Simple to build • Work well with simple op'mizers
• Disadvantages • Poten'ally inaccurate • Can’t represent variable interac'on
Model Accuracy
Model Crea'on Online vs. Offline Training
• Offline profiling op'ons – Profile applica'ons in advance
• Distribute with applica'on • iTunes App Store or Android Market
– Create applica'on profiles in the Cloud • Record performance and resource sta's'cs from users
• MSR is currently doing this to make perf. models for app developers
• Online profiling op'ons – Install 'me profiling
• Opera'ng system tests out a variety of configura'ons
– Online refinement of models • Opera'ng system starts with a generic model
• Retrains the model with new informa'on as the applica'on runs
Performance Isola'on • Without performance isola'on,
– An applica'ons performance could vary widely as a result of concurrently running applica'ons
– Inaccurate models – Requires different models of the applica'on based on the system load
• Performance predictability is an important component for applica'on modeling
– Other advantages • BeFer tuned, more efficient applica'ons
• Easier to make QoS guarantees
RESOURCE ALLOCATION USING CONVEX OPTIMIZATION
An Example
Minimizing the Urgency of The System
La = PMa(r(0,a), r(1,a), …, r(n-‐1,a)) La
Ua(La)
Con'nuously Minimize (subject to
restric'ons on the total amount of
resources)
Lb = PMb(r(0,b), r(1,b), …, r(n-‐1,b)) Lb
Ub(Lb)
Li = PMi(r(0,i), r(1,i), …, r(n-‐1,i)) Li
Ui(Li)
Performance Model Urgency Func'on
[Burton Smith (MSR), Opera'ng System Resource Management (Keynote), IPDPS 2010]
Li = PM(r(0, i), r(1, i), …, r(n-‐1, i))
Li
Ui (Li)
Service Requirement
si = slope
Ui (Li) = MAX(si ·∙ (Li -‐ di), 0)
Urgency Func'on
• Reflects the importance of cell Ci to the user
di
[Burton Smith (MSR), Opera'ng System Resource Management (Keynote), IPDPS 2010]
Performance Model (PM) Expected to decrease with resources
r(1,i): Alloca'on of resource of type 1 to Cell Ci
Li
Ui (Li)
si = slope
Performance-‐Aware Convex Op'miza'on for Resource Alloca'on
• Advantages • Convex op'miza'on is relaJvely inexpensive op'miza'on problem with a single extreme point
• Urgency Func'on Slopes allow the system to express relaJve prioriJes of applica'on
• Priori'es change as a func'on of performance
• Urgency Func'on Intercept encapsulates QoS requirements
• And addi'onal process can be used to represent system energy
[Burton Smith (MSR), Opera'ng System Resource Management (Keynote), IPDPS 2010]
Very sensi've to the performance models of the applica'ons
HARDWARE PARTITIONING Approaches + Mechanisms
GSFm: Globally Synchronized Frames Bandwidth Par''oning for the memory hierarchy
• Frame-‐Based QoS System – Transac'ons are labeled with a frame number
– Head frame moves through the network with a top priority
A
A VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)
C B VC 0 (Fr0) VC 1 (Fr1) VC 2 (Fr2)
Proc A Proc B Proc C Proc D
B C
B A D
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
frame window
shift
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
with Jae Lee, MIT
GSFm: Globally Synchronized Frames Bandwidth Par''oning for the memory hierarchy
• Uses source-‐side suppression • Applica'ons are given a bandwidth alloca'on per frame
– Credits per resource • Memory channels and memory bank,
network link, etc – Memory transac'ons are charged for all possible resources
• Delayed into a future frame if the app doesn’t have enough credits
off-‐chip DRAMs 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
dir 0
mem ctrl 0 dir 1
mem ctrl 1
off-‐chip DRAMs
1
Networks Memory
channel bank c2hREQ h2cRESP h2cREQ c2cRESP
1H 1P 0 0 1T 1T
H: header-‐only message P: header+payload message T: memory transac'on
Advantages of GSFm
• Minimum Bandwidth Guarantees – Flexible – Differen'ated – Weighted sharing of the excess bandwidth
• Good U'liza'on – Early frame reclama'on – Excess bandwidth
• Minimal Hardware Requirements – Reasonable area – Distributed 1234567812
3456
78
0
0.02
0.04
0.06
accepted throughput [flits/cycle/node]
0.02!
0.04!
0.06!
0!
Cache Par''oning
• Way-‐Based – Simple Indexing – Changes the replacement policy – Reduced associa'vity – Limited by number of ways – No locality for NUCA systems
• Bank-‐Based – Locality in NUCA – More complex indexing – Requires flush on reconfigura'on – Limited by number of banks
A SIMPLE EVALUATION RAMP Gold Experiments
Experimental Planorm
• RAMP Gold: FPGA-‐Based Simulator Target Machine
– 64 single-‐issue in-‐order cores @ 1GHz
• Par''onable into sets of 8 – Private L1 Instruc'on and Data Caches each 32KB – Shared L2 Cache 8MB inclusive 10 ns latency
• Par''onable into 8 slices using page coloring
– Memory bandwidth with magic interconnect
• Par''onable into 3.4 GB/s units assigned to a set of cores
• ROS Kernel Code – Microbenchmarks & PARSEC Benchmarks
Applica'on Modeling
• Use 10 sample points – 18.5% of the 54 Possible Alloca'ons – Selected using Audze-‐Eglais Design of Experiments
• Create a model of the applica'on – Input: Resources – Output: Performance
• Explore different types of models – Linear – Quadra'c – KCCA (Machine Learning) – Gene'cally Programmed Response Surfaces (GPRS)
• Run all 54 Alloca'ons to test model accuracy Latency
Memory alloca'on
Scheduling Experiment
• Evaluate Objec've Func'on for a Pair of Benchmarks using the models
– Race to Halt • Min Max(Cycle1, Cycle2)
– Least Cycles • Min (Cycle1*Cores1 + Cycle2*Cores2)
– Lowest Energy • Min Σi (resource u'liza'on i*energy parameter i)
– Using MATLAB’s fmincon
• Run all 54 possible resource alloca'ons for each pair of benchmarks
– Assumes that all resources must be allocated
Spa'al Par''oning Results
• Time-‐Mux’ing is on average of 2x worse than the best spa'al par''on
• However the worst spa'al par''on is quite bad.
• Naively dividing the machine in half is 1.75x worse than the best spa'al par''on
• Linear model is within 8% of op'mal every 'me
• Quadra'c Model is within 3% of op'mal every 'me
• KCCA does well in some cases but very poorly in others
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Blackscholes and Streamcluster
Bodytrack and Streamcluster
Blackscholes and Fluidanimate
Loop Micro and Random Access Micro
Sum of C
ycles on
All Co
res (Normalized
to Best)
Best Spa'al Par''oning Time Mul'plexing
Divide the Machine in Half Predic'on with Linear Model
Predic'on with Quadra'c Model Predic'on with KCCA Model
Worst Spa'al Par''oning
(7.15)
Conclusions
• Simple applica'on models show promise – S'll lots of challenges
• When to build?
• How to store? • Variability
• Resource alloca'on using convex op'miza'on poten'ally very interes'ng
– Lots of parameters to tune • How do we set the urgency func'ons?
– How does it compare with other op'ons?
QUESTIONS?