Efficient Multiprogramming for Multicores with SCAF Published in MICRO-46, December 2013. Published by Timothy Creech, Aparna Kotha and Rajeev Barua. Presented

Efficient Multiprogramming for Multicores with SCAF

Published in MICRO-46, December 2013.Published by Timothy Creech, Aparna Kotha and Rajeev Barua.

Presented By: - Abhishek Mishra (13IS01F)

- Ankit Patidar (13IS03F)

What is in the presentation:

• Problem Statement• Motivation behind solving the problem• Introduction• Existing solutions to the problem• Problems faced by existing solutions• Solution presented in the paper• Performance Analysis• Conclusion

Problem Statement:• Now-a-days scenario: Hardware is becoming increasingly parallel and parallel Applications are

appreciated more.

• Applications containing multiple Malleable Processes want sophisticated and flexible resource management.

Malleable processes: which can vary the number of threads used when they run.

• Before SCAF, No good strategy had been deployed for intelligent allocation of hardware threads => causing Downgrade in Program Efficiency.

Motivation behind solving the problem:• Running multiple parallelized programs on many core system makes machine quickly oversubscribed (no resource

utilization strategy).

Oversubscribed: if no. of computationally intensive threads exceeds no. of h/w contexts.

• Problem of space sharing of cores, if we have more cores than applications (no criteria exists).

Solutions: > Time share the h/w resources by Context Switching (poor solution).

CS incurs more overhead, long waiting times reducing system throughput.

> Using some synchronization techniques like Spinlocks.

Spinlocks requires dedicated h/w for reasonable performance.

•A Run-time allocation decision is needed based on observed efficiency without any paradigm change, program modifications, profiling or recompilation (approximating space sharing => LOAD BALANCING).

Introduction:

• Parallel efficiency of executed code, E=S/P; where executing the code in parallel on P h/w contexts yielded a speedup S over serial execution.

MAIN TASK- Maximizing E for all processes in a system. Maximizing E => Maximizing Speedup achieved, S.

• Avg. h/w contexts contributes large speed up, so space sharing of threads is approximated (Load Balancing).

• In case of Truly malleable parallel programs, it is hard to change the no. of s/w threads at runtime i.e. Load Balancing is not easy)

• Making such a decision “automatically” for malleable processes is SCAF (SCheduling and Allocation with Feedback).

Introduction contd.

Following requirements need to be satisfied by SCAF:• Total system efficiency optimized.• No modification or recompilation.• Effectiveness in both batch and real time processing scenarios.• System load from both processes (truly malleable and not truly malleable), is taken into account

(since SCAF only accounts for malleable processes).

SCAF seeks to solve “performance and administrative problems” related to execution of multiple multithreaded application on multi core machines.

Existing solutions to the problem:1. Controlling Oversubscription by modifying system’s thread package:

“Tucker et al” modified a Thread package used on Encore Multimax, and created a centralized daemon, which suspend the no. of running threads on the system to avoid oversubscription, when necessary.

Disadvantages:

> no run-time performance measurements taken into account => not good for malleable processes.

> use of some specific parallel paradigm, where programmer has to create a queue of tasks to be executed by threads => manual intervention is needed.

By modifying only system’s thread package, many programs were supported but things were not fully automated !!!!

Reference: A. Tucker and A. Gupta, Process control and scheduling issues for multiprogrammed shared-memory multiprocessors," in Proceedings of the twelfth ACM symposium on Operating systems principles, ser. SOSP’89.

Existing solutions to the problem contd.2. Load Balancing by creating explicit Worker-Threads:

“Arora et al” designed a strictly user-level, work-stealing thread scheduler, which creates certain no. of worker threads, which are allowed to “steal” work from one another in order to load balance.

Work-stealing: Process in which programmer specifies all parallelism in a declarative manner and then worker threads come into picture.

Disadvantages:

> Independent worker threads for each process leads to more worker threads than h/w contexts => Oversubscription Problem.

> Relies on work stealing programming model => does not take advantage of malleability. > Implementation point of view: implementation is quite hard and difficult.

Reference: N. S. Arora, R. D. Blumofe, and C. G. Plaxton, Thread scheduling for multiprogrammed multiprocessors," in Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, ser. SPAA '98.

Existing solutions to the problem contd.3. Run-time allocation decision based system:

“McFarland” created a prototype system called “RSM”, which includes a programming API. RSM daemon attempts to make allocation decisions according to run-time observations of work done by each process.

Processes are given larger allocations if they perform more useful work.

Disadvantages:

> RSM only considers the absolute IPC of each process, while SCAF considers efficiency observed at run-time.

> Program Recompilation need to be done.

Reference: D. J. McFarland, Exploiting Malleable Parallelism on Multicores Systems. [Blacksburg, Va: UniversityLibraries, Virginia Polytechnic Institute and State University, 2011.

SCAF vs. other related implementations:

Fig1: Feature comparison of related implementations.

Solution presented in the paper: SCheduling and Allocation with Feedback (SCAF) system is presented in the paper, having following

characteristics:

a drop-in runtime solution.supports existing malleable applications.decision is based on observed efficiency of processes.no paradigm change, no recompilation or profiling of processes.

SCAF daemon is presented, which will be estimating process efficiency purely at runtime using h/w counters, these estimated process efficiency will be used later in allocation decisions.

How SCAF does the task?For accounting the efficiency of each and every process in the system makes use of profiling. We know already Parallel Efficiency, E=S/P, where S= speedup, P= no. of h/w contexts.

PROBLEM: To calculate S, we must know serial execution time. However in a parallel program, serial measurements are not available directly. How to find this serial execution time at low cost?

SOLUTION: We estimate the serial execution time by cloning the parallel process into a serial experimental process dynamically. The serial experiment is run concurrently with the parallel code as long as the parallel code executes.

The parallel process runs on N-1 cores, and the serial process runs on 1 core.

N : Number of threads the parallel process has been allocated.

NOTE : This will not give the correct value but will give a good estimate of the serial execution time.

Sharing Policies used by SCAF:Following policies are implemented by SCAF daemon:• Minimizing the “Make Span” i.e. minimizing the total amount of time to complete all jobs.• Equipartitioning i.e. fair sharing of hardware resources (initially).

Where ‘N’ = no. of h/w contexts available, ‘k’ = no. of processes running, And ‘P j

’ = Threads allocated to process j.

• Maximizing the Sum Speedup i.e. maximize the total sum of speedups achieved by the running processes.

• Maximizing the Sum Speedup Based on Runtime Feedback. i.e. SCAF clients maintain and report a single efficiency estimate per process, so that load balancing and intelligent allocation of resources can be done dynamically.

Efficiency estimate used by SCAF daemon

• Efficiency estimate allows SCAF daemon to reason about how efficiently each process makes use of more cores relative to other processes.

• By cloning a serial process, we find the estimated value of serial execution of program. The daemon uses this efficiency estimate to build a simple speedup model.

Where Ej is reported parallel efficiency from process j, and Pj

’ is previous allocation for j.

NOTE: Cj is constant factor which gives us the recent information of resource usage of process j, i.e. feedback about process j.

Each round of feedback from client accounts useful information about space sharing of processes and distribution of load among hardware contexts.

Working of SCAF daemon:

Fig2: Parallel section with lightweight serial experiment.

Example:

Given figure,No. of hardware contexts=162 processes : “foo” and “baz”.“foo” efficiency: 2/3 on 12 threads.“baz” efficiency: 3/8 on 4 threads.

Applying speedup model we can see that Cfoo = 6 and Cbaz = 1 and new allocations are computed as, Pfoo= 14 and Pbaz= 2.

If resulting feedback indicates a good match with the predicted model, then the same model and solution will be maintained and allocation will remain the same.

Fig3: Runtime feedback loop in SCAF

Performance Analysis of SCAF:

• Evaluation of SCAF has been done using NAS NPB parallel benchmarks. On concurrently running, 70% of benchmarks pairs on 8-core Xeon processor saw improvements averaging 15% in sum of speedups compared to equipartitioning .

• For a 64-context Sparc T2 processor, 57% of the benchmarks pairs saw a similar 15% improvement over equipartitioning.

Performance Analysis of SCAF Contd.

Fig4(a):Results on a dual Intel Xeon E5410 with 8 hardware contexts.

Fig4(b):Results on a Sparc T2 processor with 64 hardware contexts.

Thank You!!!

Documents

Efficient Multiprogramming for Multicores with SCAF Published in MICRO-46, December 2013. Published by Timothy Creech, Aparna Kotha and Rajeev Barua. Presented