Upload
patrick-bellasi
View
886
Download
2
Embed Size (px)
DESCRIPTION
Emerging multi/many-core architectures, targeting both High Performance Computing (HPC) and mobile devices, increase the interest for self-adaptive systems, where both applications and computational resources could smoothly adapt to the changing of the working conditions. In these scenarios, an efficient Run-Time Resource Manager (RTRM) framework can provide a valuable support to identify the optimal trade-off between the Quality-of-Service (QoS) requirements of the applications and the time varying resources availability. This presentation introduces a new approach to the development of a system-wide RTRM featuring: a) a hierarchical and distributed control, b) the exploitation of design-time information, c) a rich multi-objective optimization strategy and d) a portable and modular design based on a set of tunable policies. The framework is already available as an Open Source project, targeting a NUMA architecture and a new generation multi/many-core research platform. First tests show benefits for the execution of parallel applications, the scalability of the proposed multi-objective resources partitioning strategy, and the sustainability of the overheads introduced by the framework.
Citation preview
Exploiting Linux Control Groups for Effective Run-time Resource Management
P. Bellasi, G. Massari and W. Fornaciari{bellasi, massari, fornacia}@elet.polimi.it
Speaker: Prof. William Fornaciari
Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano
Last revision Jan, 18 2013
Exploiting Linux CGroups for Effective RTRM2
IntroductionWhy Run-Time Resource Management?
Computing platforms convergencetargeting both HPC and high-end embedded and mobile systemsparallelism level ranging from few to hundreds of PEs
thanks to silicon technology progresses
Emerging new set of non-functional constraintsthermal management, system reliability and fault-tolerance
area and power are typical design issues
embedded systems are loosing exclusiveness
effective resource management policies required to properly exploit modern computing platforms
Exploiting Linux CGroups for Effective RTRM3
IntroductionHow we compare?
Different approaches targeting resources allocationLinux scheduler extensions
mostly based on adding new scheduler classes [2,4,7]
force the adoption of a customized kernelVirtualization
Hypervisor acting as a global system managerBoth commercial and open source solutions
Commercial: e.g. OpenVZ, VServer, Montavista Linux; Open: e.g. KVM, Linux Containers
require HW support on the target systemUser-space approaches
more portable solutions [3,6,11]
mostly limited to CPU assignment
[2] Bini et. al., “Resource management on multicore systems: The actors approach”. Micro 2011.[3] Blagodurov and Fedorova, “User-level scheduling on numa multicore systems under linux”, Linux Symposium 2011.[4] Fu and Wang., “Utilization-controlled task consolidation for power optimization in multi-core real-time systems”. RTCSA 2011.[6] Hofmeyr et. al.,. “Load balancing on speed”. PpoPP 2010.[7] Li et. al., “Efficient operating system scheduling for performance-asymmetric multi-core architectures”. SC 2007.[11] Sondag and Rajan, “Phase-based tuning for better utilization of performance-asymmetric multicore processors”. CGO 2011.
Exploiting Linux CGroups for Effective RTRM4
IntroductionHow we compare?
Different approaches targeting resources allocationLinux scheduler extensions
mostly based on adding new scheduler classes [2,4,7]
force the adoption of a customized kernelVirtualization
Hypervisor acting as a global system managerBoth commercial and open source solutions
Commercial: e.g. OpenVZ, VServer, Montavista Linux; Open: e.g. KVM, Linux Containers
require HW support on the target systemUser-space approaches
more portable solutions [36,11]
mostly limited to CPU assignment
[2] Bini et. al., “Resource management on multicore systems: The actors approach”. Micro 2011.[3] Blagodurov and Fedorova, “User-level scheduling on numa multicore systems under linux”, Linux Symposium 2011.[4] Fu and Wang., “Utilization-controlled task consolidation for power optimization in multi-core real-time systems”. RTCSA 2011.[6] Hofmeyr et. al.,. “Load balancing on speed”. PpoPP 2010.[7] Li et. al., “Efficient operating system scheduling for performance-asymmetric multi-core architectures”. SC 2007.[11] Sondag and Rajan, “Phase-based tuning for better utilization of performance-asymmetric multicore processors”. CGO 2011.
More dynamic usage of Linux Control Groups to manage multiple resources with a portable
and modular RTRM running in user-space
More dynamic usage of Linux Control Groups to manage multiple resources with a portable
and modular RTRM running in user-space
Exploiting Linux CGroups for Effective RTRM5
Resources Partitioning MechanismWhy Linux Control Groups?
Standard Linux framework, since 2.6.24allows to group and bind tasks to a set of system resources
e.g. CPUs, memory quota and I/O bandwidth
resources could be either shared or exclusive assigned Allows to define isolated execution environments
light-weight virtualization, i.e. low run-time overhead Mostly used for “quasi-static” configuration
administrators tool to shape the system usage
Increasing set of resources controllers
We use it in a dynamic wayas a effective mechanism
to support RTRM
We use it in a dynamic wayas a effective mechanism
to support RTRM
Exploiting Linux CGroups for Effective RTRM6
The BarbequeRTRMOverall View on Run-Time Resource Management
System-Wide RTRMCoarse grained control on platform available resources:- resource accounting, partitioning and abstraction- high-level HW events handling e.g., critical conditions, faults...- manage applications priorities- power/thermal “coarse tuning”
Application-Specific RTMFine grained control on application allocated resources:- task ordering- virtual processor assignment- DVFS- application parameters monitoringDynamic Code
Generation
Task Mapping
DDM
Critical Apps Best-Effort Apps
RTLib
Res Accounting Res Partitioning
Res Abstraction
MRAPI
Platform DRVPlatform DRVPlatform Driver
Platform Proxy
supported platforms
kernel
user-space
H
Platform Firmware
C
F
G
I
ba
c
d
e
f
RTLib
C
ED
A B
X SW Interface (API)
SW/HW Meta-dataY
Legend
BarbequeRTRM
[1] Bellasi et.al., ”A RTRM proposal for multi/many-core platforms and reconfigurable applications”. ReCoSoC 2012.
Exploiting Linux CGroups for Effective RTRM7
The BarbequeRTRMOverall View on Run-Time Resource Management
System-Wide RTRMCoarse grained control on platform available resources:- resource accounting, partitioning and abstraction- high-level HW events handling e.g., critical conditions, faults...- manage applications priorities- power/thermal “coarse tuning”
Application-Specific RTMFine grained control on application allocated resources:- task ordering- virtual processor assignment- DVFS- application parameters monitoringDynamic Code
Generation
Task Mapping
DDM
Critical Apps Best-Effort Apps
RTLib
Res Accounting Res Partitioning
Res Abstraction
MRAPI
Platform DRVPlatform DRVPlatform Driver
Platform Proxy
supported platforms
kernel
user-space
H
Platform Firmware
C
F
G
I
ba
c
d
e
f
RTLib
C
ED
A B
X SW Interface (API)
SW/HW Meta-dataY
Legend
BarbequeRTRM
CGroups
CGroups based resources
abstraction layer
Extend advanced and efficient resources control
capability offered by modern Linux Kernels
with suitableresources partitioning
policies
running in user-space
Congested workloads
Regular Workload
Exploiting Linux CGroups for Effective RTRM8
Experimental SetupHardware Platform and Workloads
Workloads: increasing number ofconcurrently running applications
Bodytrack (BT) (PARSEC v2.1)
modified to be run-time tunable and integratedwith the BarbequeRTRM
https://bitbucket.org/bosp/benchmarks-parsec
Platform: Quad-Core AMD Opteron 83784 core host partition, 3x4 CPUs accelerator partition
running up to 2.8GHz , 16 Processing Elements (PE)CPUFreq and its on-demand policy
Goal: assess framework capability to efficiently manage resources on increasingly congested workload
scenarios
Linux Host
Cgroups Managed Device Partition
Exploiting Linux CGroups for Effective RTRM9
Experimental SetupMetrics Collection
Compare Bodytrack original vs integrated versionusing same maximum amount of thread
the BBQ Managed version could reduce this number at Run-Time Original version controlled by Linux scheduler,
integrated version managed by BarbequeRTRM
Performances profilingusing standard frameworks
Using Linux perf framework to collect HW/SW performance counter
IPMI Interface for system-wide power consumption [W]
(*) The lower the better, for all metrics but the IPC
*
Exploiting Linux CGroups for Effective RTRM10
ResultsWorkload Burst Performance Comparison
1 Thread
Completion Time CPU Migrations CTX Switches Power [W]
Statistics based on: 30 runs, 99% confidence interval
BBQ managed apps pinned to assigned CPUs
improved code execution efficiency IPC: 1.080 => 1.235
High Systemcongestion
BBQ partially serialize the execution of
concurrent workloadsIPC: 1.070 => 1.325
8 Threads
A B
C D
ABCD
A
Improvements [%] - BBQ Manged vs Unmanaged
Up to x6 moreenergy efficient
> x1.3 faster
Reduced OS overheadImproved code efficiency
Exploiting Linux CGroups for Effective RTRM11
ResultsBenefits and Loss Comparison
1 Thread 8 Threads
A C
DB
positive bar corresponds to an improvement while a negative bar represent a deficiency of the managed application with respect of the original one
Same order of magnitude for “migrations” on lower congestions
Normalized speedups for all collected performance counters
“page faults” and “branch rate”always degraded because of code organization for BBQ integration
loop-unrolling could not be applied, but...an improved integration has already been identified
Instruction stream optimization could be achieved by treading compile time optimization with effective resources assignment
1 1
1 1
1
2
22
22
2 222
Exploiting Linux CGroups for Effective RTRM12
The BarbequeRTRM FrameworkConclusions & Future Works
New user-space approach to RTRMexploiting an advanced and efficient resources control framework offered by modern Linux kernels
providing a tunable resources partitioning policy
Evaluate the Linux Control Group effectivenessto support mandatory resources assignment to concurrently running applicationsassessment using a benchmark from PARSEC v2.1
updated to be run-time tunable and integrated with our framework
More than 30% speed-up, up to x6 energy efficiencyoverall improved instruction stream optimization
confirmed by many HW/SW performance counters
Main future activities: integrate more benchmarks and explore more compiler-friendly integrations
Thanks for your attention!
If you are interested, please checkthe project website for further information
and keep update with the developments
http://bosp.dei.polimi.it
Backup Slides
Exploiting Linux CGroups for Effective RTRM15
The BarbequeRTRMHow we compare?
Multi-O
bjective
Heter. P
latforms
Hom
og. Platform
s
Reconf./A
dapt
Mult. R
esources
Clustered R
esourcesC
ontrol-Theory Model
Design-Tim
e Exploitation
Portability
ResourcesManagers Proposals
StarPU
Binotto et al.
Fu et al.
ACTORS
SEEC
DistRM
BarbequeRTRM
Desirable Properties
P. Bellasi et. al. “A RTRM proposal for Multi/Many-Core platforms and reconfigurable applications”7th International Workshop on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC'12), York, UK, 07/2012.
Exploiting Linux CGroups for Effective RTRM16
The Proposed Control SolutionDistributed Hierarchical Control
Different subsystems have their own control loop (CL)System-wide level (resources partitioning, system-wide optimization, ...)
Application specific (application tuning, dynamic memory management, ...)
Firmware/OS level (F/V control, thermal alarms, resource availability, ...)
FF closed CLusing OP and AWM
Optimaluser defined goal functionsincluding overheads
Robust Adaptive
BBQ
Exploiting Linux CGroups for Effective RTRM17
Scheduling PolicySystem-Wide Controller – Overall View
BBQ Validation Policy- enforce certain control properties
energy budget, stability and robustness- authorize resources synchronization
Exploiting Linux CGroups for Effective RTRM18
Scheduling PolicySystem-Wide Controller – Inner-Loop “Scheduling”
BBQ Validation Policy- enforce certain control properties
energy budget, stability and robustness- authorize resources synchronization
YaMS
Exploiting Linux CGroups for Effective RTRM19
Scheduling PolicySystem-Wide Controller – Inner-Loop Overheads
+
+ +
Apps with 3 AWM, 3 Clusters => 9 configuration per applicationBBQ running on NSJ, 4 CPUs @ 2.5GHz (max)
Exploiting Linux CGroups for Effective RTRM20
Scheduling PolicyYaMS - Scalability
Speedup
+36%
+54%
Exploiting Linux CGroups for Effective RTRM21
The BarbequeRTRMThe Barbeque OpenSource Project (BOSP)
Framework dependenciesExternal libs, tools, ...
Framework SourcesBarbequeRTRM, RTLib
Framework ToolsPyGrill (loggrapher), ...
ContributionsTutorials, demo
Public GIT repository
Based on (a customization of) Android building systemfreely available for download and (automatized) building
https://bitbucket.org/bosp
Exploiting Linux CGroups for Effective RTRM22
The BarbequeRTRM FrameworkWhy such a name?!?
Because of its “sweet analogy” with something everyone knows...
Mixed Workloadsausages, steaks, chops
and vegetables
Resourcescoals and grill
QoShow good is the grill
Policythe cooking recipe
Priorityhow thick is the meat
orhow much you are hungry
Task mappingthe chef's secret
Thermal Issuesburning the flesh
Reliability Issuesdropping the flesh
OverheadsCook fast and light
Applicationsthe stuff to cook
Exploiting Linux CGroups for Effective RTRM23
Synchronization PolicySystem-Wide Controller – Outer-Loop “Synchronization”
BBQ Validation Policy- enforce certain control properties
energy budget, stability and robustness- authorize resources synchronization
Exploiting Linux CGroups for Effective RTRM24
Synchronization PolicySystem-Wide Controller – Outer-Loop Overheads
CGroupsPIL
+ +
+
+
+
min AWM 25% CPU Time, 3 Clusters x 4CPUs => max 48 syncsBBQ running on NSJ, 4 CPUs @ 2.5GHz (max)
Linux kernel 3.2Creation overheads: ~500msUpdate overheads: ~100ms
(1/3 on quadcore i7)
Application dependent
Exploiting Linux CGroups for Effective RTRM25
The BarbequeRTRM FrameworkPower Optimizations
X86_64 NUMA machine: 3 Clusters x 4CPUsBBQ running on NSJ, 4 CPUs @ 800MHz
Initial experiments on congested workloadsincreasing running instances of Bodytrack (PARSEC)
3AWM: [1,2,4] Threads
system-wide power measurementsvia the standard IPMI interface
Power Gains2,3-3,7%
Time Gains338-625%