21
Seminar “Parallel Computing“ Summer term 2008 Seminar paper Parallel Computing (703525) Optimisation: Operating System Scheduling on multi-core architectures Lehrveranstaltungsleiter: T. Fahringer Name Matrikelnummer Thomas Zangerl 1

Operating System Scheduling on multi-core architectures

Embed Size (px)

DESCRIPTION

As multi-core architectures begin to emerge in every area of computing, operating system scheduling that takes the peculiarities of such architectures into account will become mandatory. Due to architectural differences to traditional multi-processors, such as shared caches, memory controllers and smaller cache sizes available per computational unit, it does not suffice to simply schedule tasks on multi-core processors in the same way as on SMP systems.Furthermore, current research motivates architectural changes in CPU design, such as multicore processors with asymmetric core-performance and so called many-core architectures that integrate up to 100 cores in one package. Such architectures will exhibit a fundamentally different behaviour with regard to shared resource utilization and performance of nonparallelizable code compared to current CPUs. It will be the responsibility of the operating system to spare the programmer as much platform-specific knowledge as possible and optimize overall performance by employing intelligent and configurable scheduling mechanisms.

Citation preview

Page 1: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Seminar paper

Parallel Computing (703525)

Optimisation Operating System Scheduling on multi-core architectures

Lehrveranstaltungsleiter T Fahringer

Name MatrikelnummerThomas Zangerl

1

Seminar ldquoParallel Computingldquo Summer term 2008

Abstract

As multi-core architectures begin to emerge in every area of computing operating system scheduling that takes the peculiarities of such architectures into account will become mandatory Due to architectural differences to traditional multi-processors such as shared caches memory controllers and smaller cache sizes available per computational unit it does not suffice to simply schedule tasks on multi-core processors in the same way as on SMP systemsFurthermore current research motivates architectural changes in CPU design such as multi-core processors with asymmetric core-performance and so called many-core architectures that integrate up to 100 cores in one package Such architectures will exhibit a fundamentally different behaviour with regard to shared resource utilization and performance of non-parallelizable code compared to current CPUs It will be the responsibility of the operating system to spare the programmer as much platform-specific knowledge as possible and optimize overall performance by employing intelligent and configurable scheduling mechanisms

2

Seminar ldquoParallel Computingldquo Summer term 2008

Abstract21 Introduction 4

11 Why use multi-core processors at all 4 12 Whatrsquos so different about multi-core scheduling 5

2 OS process scheduling state of the art 7 21 Linux scheduler 7

211 The Completely Fair Scheduler 7 212 Scheduling domains 7

22 Windows scheduler 9 23 Solaris scheduler 10

3 Ongoing research on the topic 11 31 Cache-Fairness 11 32 Balancing core assignment 12 33 Performance asymmetry 13 34 Scheduling on many-core architectures 15

4 Conclusion 18 5 References 20

3

Seminar ldquoParallel Computingldquo Summer term 2008

1 Introduction

11Why use multi-core processors at all

In the last few years multi-core CPUs have become a standard component in nearly all sorts of computers ndash not only servers and high-end workstations but also desktop and laptop PCs for consumers and even game consoles nowadays usually come with CPUs with more than one core This development is not surprising already in 1965 Gordon Moore predicted that the number of transistors that can be cost-effectively built onto integrated circuits were going to double every year ([1]) In 1975 Moore corrected that assumption to a period of two years nowadays this period is frequently assumed to be 18 months)

Moores projection has more or less been accurate up to today and consumers have gotten used to the constant speedup of computer hardware ndash it is expected by buyers that a new computer shows a significant speedup to a two years older model (even though an increase in transistor density does not always lead to an equal in increase in computing speed) For chip manufacturers it has become increasingly difficult however to keep up with Moores law In order to implement the exponential increase of integrated circuits the transistor structures have to become steadily smaller On the one hand the extra transistors were used for the integration of more and more specialized instruction sets on CISC chips On the other hand smaller transistor sizes led to higher clock rates of the CPUs because due to physical factors the gates in the transistors could perform faster state switches

However since electronic activity always produces heat as an unwanted by-product the more transistors are packed together in a small CPU die area the higher the resulting heat dissipation per unit area becomes ([2]) With the higher switching frequency the electronic activity was performed in smaller intervals and hence more and more heat-dissipation emerged The cooling of the processor components became more and more a crucial factor in design considerations and it became clear that the increasing clock frequency could no longer serve as the primary reason for processor speedup

Hence there had to be a shift in paradigm in order to still make applications run faster on the one hand the amazing abundance of transistors on processor chips was used to increase the cache sizes This alone however would not result in an adequate performance gain since it only helps memory intensive applications to a certain degree In order to effectively counteract the heat problem while making use of the small structures and high number of transistors on a chip the notion of multi-core processors for consumer PCs was introduced

Since the CMOS technology met its limits for the further increase of CPU clock frequency and the number of transistors that could be integrated on a single die allowed for it the idea emerged that multiple processing cores could be placed in a single processor dieIn 2006 Intel released the Coretrade microprocessor a die package with two processor cores with their own level 1 caches and a shared level 2 cache ([3]) Also in 2006 AMD the second major CPU manufacturer for the consumer market released the Athlontrade X2 a processor with quite similar architecture to the Core platform but additionally featuring the concept of also sharing a CPU-integrated memory-controller among the cores ([[4])

4

Seminar ldquoParallel Computingldquo Summer term 2008

Both architectures have been improved and sold with a range of consumer desktop and laptop computers - but also servers and workstations - up to today therefore the presence of multi-core processors in a large number of todays PCs can be assumed

12Whatrsquos so different about multi-core scheduling

One could assume that the scheduling process on such multi-core processors wouldnrsquot differ much from conventional scheduling ndash intuitively the run-queue would just have to be replaced by n run-queues where n is the number of cores and processes would simply be scheduled to the currently shortest run-queue (with some additional process-priority treatment maybe)

While that might seem reasonable there are some properties of current multi-core architectures that speak strongly against such a naiumlve approach First in many multi core architectures each core manages its own level 1 cache (Figure 1) By just naiumlvely rescheduling interrupted processes to a shorter queue which belongs to another core (task migration) parts of the processes cache working set may become unnecessarily lost and the overall performance may slow down This effect becomes even worse if the underlying architecture is not a multi-core but a NUMA system where memory access can become very costly if the process is scheduled on the ldquowrongrdquo node

Figure 1 Typical multi-core architecture

A second important point is that the performance of different cores in a multi-core system may be asymmetric regarding the performance of the different cores ([5]) This effect can emerge due to several reasons

5

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 2: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Abstract

As multi-core architectures begin to emerge in every area of computing operating system scheduling that takes the peculiarities of such architectures into account will become mandatory Due to architectural differences to traditional multi-processors such as shared caches memory controllers and smaller cache sizes available per computational unit it does not suffice to simply schedule tasks on multi-core processors in the same way as on SMP systemsFurthermore current research motivates architectural changes in CPU design such as multi-core processors with asymmetric core-performance and so called many-core architectures that integrate up to 100 cores in one package Such architectures will exhibit a fundamentally different behaviour with regard to shared resource utilization and performance of non-parallelizable code compared to current CPUs It will be the responsibility of the operating system to spare the programmer as much platform-specific knowledge as possible and optimize overall performance by employing intelligent and configurable scheduling mechanisms

2

Seminar ldquoParallel Computingldquo Summer term 2008

Abstract21 Introduction 4

11 Why use multi-core processors at all 4 12 Whatrsquos so different about multi-core scheduling 5

2 OS process scheduling state of the art 7 21 Linux scheduler 7

211 The Completely Fair Scheduler 7 212 Scheduling domains 7

22 Windows scheduler 9 23 Solaris scheduler 10

3 Ongoing research on the topic 11 31 Cache-Fairness 11 32 Balancing core assignment 12 33 Performance asymmetry 13 34 Scheduling on many-core architectures 15

4 Conclusion 18 5 References 20

3

Seminar ldquoParallel Computingldquo Summer term 2008

1 Introduction

11Why use multi-core processors at all

In the last few years multi-core CPUs have become a standard component in nearly all sorts of computers ndash not only servers and high-end workstations but also desktop and laptop PCs for consumers and even game consoles nowadays usually come with CPUs with more than one core This development is not surprising already in 1965 Gordon Moore predicted that the number of transistors that can be cost-effectively built onto integrated circuits were going to double every year ([1]) In 1975 Moore corrected that assumption to a period of two years nowadays this period is frequently assumed to be 18 months)

Moores projection has more or less been accurate up to today and consumers have gotten used to the constant speedup of computer hardware ndash it is expected by buyers that a new computer shows a significant speedup to a two years older model (even though an increase in transistor density does not always lead to an equal in increase in computing speed) For chip manufacturers it has become increasingly difficult however to keep up with Moores law In order to implement the exponential increase of integrated circuits the transistor structures have to become steadily smaller On the one hand the extra transistors were used for the integration of more and more specialized instruction sets on CISC chips On the other hand smaller transistor sizes led to higher clock rates of the CPUs because due to physical factors the gates in the transistors could perform faster state switches

However since electronic activity always produces heat as an unwanted by-product the more transistors are packed together in a small CPU die area the higher the resulting heat dissipation per unit area becomes ([2]) With the higher switching frequency the electronic activity was performed in smaller intervals and hence more and more heat-dissipation emerged The cooling of the processor components became more and more a crucial factor in design considerations and it became clear that the increasing clock frequency could no longer serve as the primary reason for processor speedup

Hence there had to be a shift in paradigm in order to still make applications run faster on the one hand the amazing abundance of transistors on processor chips was used to increase the cache sizes This alone however would not result in an adequate performance gain since it only helps memory intensive applications to a certain degree In order to effectively counteract the heat problem while making use of the small structures and high number of transistors on a chip the notion of multi-core processors for consumer PCs was introduced

Since the CMOS technology met its limits for the further increase of CPU clock frequency and the number of transistors that could be integrated on a single die allowed for it the idea emerged that multiple processing cores could be placed in a single processor dieIn 2006 Intel released the Coretrade microprocessor a die package with two processor cores with their own level 1 caches and a shared level 2 cache ([3]) Also in 2006 AMD the second major CPU manufacturer for the consumer market released the Athlontrade X2 a processor with quite similar architecture to the Core platform but additionally featuring the concept of also sharing a CPU-integrated memory-controller among the cores ([[4])

4

Seminar ldquoParallel Computingldquo Summer term 2008

Both architectures have been improved and sold with a range of consumer desktop and laptop computers - but also servers and workstations - up to today therefore the presence of multi-core processors in a large number of todays PCs can be assumed

12Whatrsquos so different about multi-core scheduling

One could assume that the scheduling process on such multi-core processors wouldnrsquot differ much from conventional scheduling ndash intuitively the run-queue would just have to be replaced by n run-queues where n is the number of cores and processes would simply be scheduled to the currently shortest run-queue (with some additional process-priority treatment maybe)

While that might seem reasonable there are some properties of current multi-core architectures that speak strongly against such a naiumlve approach First in many multi core architectures each core manages its own level 1 cache (Figure 1) By just naiumlvely rescheduling interrupted processes to a shorter queue which belongs to another core (task migration) parts of the processes cache working set may become unnecessarily lost and the overall performance may slow down This effect becomes even worse if the underlying architecture is not a multi-core but a NUMA system where memory access can become very costly if the process is scheduled on the ldquowrongrdquo node

Figure 1 Typical multi-core architecture

A second important point is that the performance of different cores in a multi-core system may be asymmetric regarding the performance of the different cores ([5]) This effect can emerge due to several reasons

5

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 3: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Abstract21 Introduction 4

11 Why use multi-core processors at all 4 12 Whatrsquos so different about multi-core scheduling 5

2 OS process scheduling state of the art 7 21 Linux scheduler 7

211 The Completely Fair Scheduler 7 212 Scheduling domains 7

22 Windows scheduler 9 23 Solaris scheduler 10

3 Ongoing research on the topic 11 31 Cache-Fairness 11 32 Balancing core assignment 12 33 Performance asymmetry 13 34 Scheduling on many-core architectures 15

4 Conclusion 18 5 References 20

3

Seminar ldquoParallel Computingldquo Summer term 2008

1 Introduction

11Why use multi-core processors at all

In the last few years multi-core CPUs have become a standard component in nearly all sorts of computers ndash not only servers and high-end workstations but also desktop and laptop PCs for consumers and even game consoles nowadays usually come with CPUs with more than one core This development is not surprising already in 1965 Gordon Moore predicted that the number of transistors that can be cost-effectively built onto integrated circuits were going to double every year ([1]) In 1975 Moore corrected that assumption to a period of two years nowadays this period is frequently assumed to be 18 months)

Moores projection has more or less been accurate up to today and consumers have gotten used to the constant speedup of computer hardware ndash it is expected by buyers that a new computer shows a significant speedup to a two years older model (even though an increase in transistor density does not always lead to an equal in increase in computing speed) For chip manufacturers it has become increasingly difficult however to keep up with Moores law In order to implement the exponential increase of integrated circuits the transistor structures have to become steadily smaller On the one hand the extra transistors were used for the integration of more and more specialized instruction sets on CISC chips On the other hand smaller transistor sizes led to higher clock rates of the CPUs because due to physical factors the gates in the transistors could perform faster state switches

However since electronic activity always produces heat as an unwanted by-product the more transistors are packed together in a small CPU die area the higher the resulting heat dissipation per unit area becomes ([2]) With the higher switching frequency the electronic activity was performed in smaller intervals and hence more and more heat-dissipation emerged The cooling of the processor components became more and more a crucial factor in design considerations and it became clear that the increasing clock frequency could no longer serve as the primary reason for processor speedup

Hence there had to be a shift in paradigm in order to still make applications run faster on the one hand the amazing abundance of transistors on processor chips was used to increase the cache sizes This alone however would not result in an adequate performance gain since it only helps memory intensive applications to a certain degree In order to effectively counteract the heat problem while making use of the small structures and high number of transistors on a chip the notion of multi-core processors for consumer PCs was introduced

Since the CMOS technology met its limits for the further increase of CPU clock frequency and the number of transistors that could be integrated on a single die allowed for it the idea emerged that multiple processing cores could be placed in a single processor dieIn 2006 Intel released the Coretrade microprocessor a die package with two processor cores with their own level 1 caches and a shared level 2 cache ([3]) Also in 2006 AMD the second major CPU manufacturer for the consumer market released the Athlontrade X2 a processor with quite similar architecture to the Core platform but additionally featuring the concept of also sharing a CPU-integrated memory-controller among the cores ([[4])

4

Seminar ldquoParallel Computingldquo Summer term 2008

Both architectures have been improved and sold with a range of consumer desktop and laptop computers - but also servers and workstations - up to today therefore the presence of multi-core processors in a large number of todays PCs can be assumed

12Whatrsquos so different about multi-core scheduling

One could assume that the scheduling process on such multi-core processors wouldnrsquot differ much from conventional scheduling ndash intuitively the run-queue would just have to be replaced by n run-queues where n is the number of cores and processes would simply be scheduled to the currently shortest run-queue (with some additional process-priority treatment maybe)

While that might seem reasonable there are some properties of current multi-core architectures that speak strongly against such a naiumlve approach First in many multi core architectures each core manages its own level 1 cache (Figure 1) By just naiumlvely rescheduling interrupted processes to a shorter queue which belongs to another core (task migration) parts of the processes cache working set may become unnecessarily lost and the overall performance may slow down This effect becomes even worse if the underlying architecture is not a multi-core but a NUMA system where memory access can become very costly if the process is scheduled on the ldquowrongrdquo node

Figure 1 Typical multi-core architecture

A second important point is that the performance of different cores in a multi-core system may be asymmetric regarding the performance of the different cores ([5]) This effect can emerge due to several reasons

5

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 4: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

1 Introduction

11Why use multi-core processors at all

In the last few years multi-core CPUs have become a standard component in nearly all sorts of computers ndash not only servers and high-end workstations but also desktop and laptop PCs for consumers and even game consoles nowadays usually come with CPUs with more than one core This development is not surprising already in 1965 Gordon Moore predicted that the number of transistors that can be cost-effectively built onto integrated circuits were going to double every year ([1]) In 1975 Moore corrected that assumption to a period of two years nowadays this period is frequently assumed to be 18 months)

Moores projection has more or less been accurate up to today and consumers have gotten used to the constant speedup of computer hardware ndash it is expected by buyers that a new computer shows a significant speedup to a two years older model (even though an increase in transistor density does not always lead to an equal in increase in computing speed) For chip manufacturers it has become increasingly difficult however to keep up with Moores law In order to implement the exponential increase of integrated circuits the transistor structures have to become steadily smaller On the one hand the extra transistors were used for the integration of more and more specialized instruction sets on CISC chips On the other hand smaller transistor sizes led to higher clock rates of the CPUs because due to physical factors the gates in the transistors could perform faster state switches

However since electronic activity always produces heat as an unwanted by-product the more transistors are packed together in a small CPU die area the higher the resulting heat dissipation per unit area becomes ([2]) With the higher switching frequency the electronic activity was performed in smaller intervals and hence more and more heat-dissipation emerged The cooling of the processor components became more and more a crucial factor in design considerations and it became clear that the increasing clock frequency could no longer serve as the primary reason for processor speedup

Hence there had to be a shift in paradigm in order to still make applications run faster on the one hand the amazing abundance of transistors on processor chips was used to increase the cache sizes This alone however would not result in an adequate performance gain since it only helps memory intensive applications to a certain degree In order to effectively counteract the heat problem while making use of the small structures and high number of transistors on a chip the notion of multi-core processors for consumer PCs was introduced

Since the CMOS technology met its limits for the further increase of CPU clock frequency and the number of transistors that could be integrated on a single die allowed for it the idea emerged that multiple processing cores could be placed in a single processor dieIn 2006 Intel released the Coretrade microprocessor a die package with two processor cores with their own level 1 caches and a shared level 2 cache ([3]) Also in 2006 AMD the second major CPU manufacturer for the consumer market released the Athlontrade X2 a processor with quite similar architecture to the Core platform but additionally featuring the concept of also sharing a CPU-integrated memory-controller among the cores ([[4])

4

Seminar ldquoParallel Computingldquo Summer term 2008

Both architectures have been improved and sold with a range of consumer desktop and laptop computers - but also servers and workstations - up to today therefore the presence of multi-core processors in a large number of todays PCs can be assumed

12Whatrsquos so different about multi-core scheduling

One could assume that the scheduling process on such multi-core processors wouldnrsquot differ much from conventional scheduling ndash intuitively the run-queue would just have to be replaced by n run-queues where n is the number of cores and processes would simply be scheduled to the currently shortest run-queue (with some additional process-priority treatment maybe)

While that might seem reasonable there are some properties of current multi-core architectures that speak strongly against such a naiumlve approach First in many multi core architectures each core manages its own level 1 cache (Figure 1) By just naiumlvely rescheduling interrupted processes to a shorter queue which belongs to another core (task migration) parts of the processes cache working set may become unnecessarily lost and the overall performance may slow down This effect becomes even worse if the underlying architecture is not a multi-core but a NUMA system where memory access can become very costly if the process is scheduled on the ldquowrongrdquo node

Figure 1 Typical multi-core architecture

A second important point is that the performance of different cores in a multi-core system may be asymmetric regarding the performance of the different cores ([5]) This effect can emerge due to several reasons

5

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 5: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Both architectures have been improved and sold with a range of consumer desktop and laptop computers - but also servers and workstations - up to today therefore the presence of multi-core processors in a large number of todays PCs can be assumed

12Whatrsquos so different about multi-core scheduling

One could assume that the scheduling process on such multi-core processors wouldnrsquot differ much from conventional scheduling ndash intuitively the run-queue would just have to be replaced by n run-queues where n is the number of cores and processes would simply be scheduled to the currently shortest run-queue (with some additional process-priority treatment maybe)

While that might seem reasonable there are some properties of current multi-core architectures that speak strongly against such a naiumlve approach First in many multi core architectures each core manages its own level 1 cache (Figure 1) By just naiumlvely rescheduling interrupted processes to a shorter queue which belongs to another core (task migration) parts of the processes cache working set may become unnecessarily lost and the overall performance may slow down This effect becomes even worse if the underlying architecture is not a multi-core but a NUMA system where memory access can become very costly if the process is scheduled on the ldquowrongrdquo node

Figure 1 Typical multi-core architecture

A second important point is that the performance of different cores in a multi-core system may be asymmetric regarding the performance of the different cores ([5]) This effect can emerge due to several reasons

5

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 6: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

bull Design considerations Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized ([6]) Even algorithms that are parallelizable contain parts that have to be executed sequentially which will benefit from the higher speed of the fastest core Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7])

bull Transistor failures Some parts of the CPU may get damaged over time and become automatically disabled by the CPU Since such components may fail in certain cores independently from the other cores performance asymmetry may arise in symmetric cores over time ([5])

bull Power-saving policies Different cores may switch to different P- or C-power-states at different times in order to save power At different P-states equal cores show a different clock-frequency If an OS scheduler manages to take this into account for processes not in need of all system resources the system can remain more energy-efficient over the execution time while giving away only little or no performance at all ([8])

Hence performance asymmetry the fact that various CPU components can be shared among cores and non-uniform access to computation resources such as memory mandate the design of efficient multi-core scheduling mechanisms or scheduling frameworks at the operating system level

Multi-core processors have gone mainstream and while there may be the demand that they are efficiently used in terms of performance the currently fashionable term Green-IT also motivates the energy-efficient use of the CPU coresSection 2 will explore how far current operating systems have evolved in support of the new architectures

6

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 7: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

2 OS process scheduling state of the art

21 Linux scheduler

211The Completely Fair Scheduler

The Linux scheduler in versions prior to 2623 performed its tasks in complexity O(1) by basically just using per-CPU run-queues and priority arrays ([9]) Kernel version 2623 which was released on October 9 2007 introduced the so-called completely fair scheduler (CFS) The change in scheduler was mainly motivated by the failure of the old scheduler to correctly predict whether applications are interactive (IO-bound) or CPU-intensive ([10])

Therefore the new scheduler has completely abandoned the notion of different kinds of processes and treats them all equally The data-structure of a red-black tree is used to align the tasks according to their ldquorightrdquo to use the processor resources for a predefined interval until context-switch The process positioned at the leftmost node in the data structure is entitled most to use the processor resources at the time it occupies that position in the tree The position of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority ([11]) This concept is fairly simple but works with all kinds of processes especially interactive ones since they get a boost just by getting account for their IO-waiting time

However the total scheduling complexity is increased to O(log n) where n is the number of processes in the run-queue since at every context-switch the process has to be reinserted into the red-black tree

The scheduling algorithm itself has not been designed in special consideration of multi-core architectures When Ingo Molnar the designer of the scheduler was asked what the implications on HTSMPNUMA architectures would be he answered that there would inevitably be some effect and if it is negative he will fix it He admits that the fairness-approach can result in increased cache-coldness for processes in some situations ([12])

However the red-black trees of CFS are managed per runqueue ([13]) which assists in cooperation with the Linux load-balancer

212Scheduling domains

Linux load-balancing takes care of different cache models and computing architectures but at the moment not necessarily of performance asymmetry The underlying model of the Linux load balancer is the concept of scheduling domains which was introduced in Kernel version 267 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systems in prior versions ([14])

Basically scheduling domains are hierarchical sets of computation units on which scheduling is possible the scheduling domain architecture is constructed based on the actual hardware resources of a computing element ([9]) Scheduling domains contain lists with scheduling groups that share common properties

7

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 8: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

For example the way scheduling should be done on two logical processors of a HT-systems and two physical processors of a SMP system is different the HT ldquocoresrdquo share a common cache and memory hierarchy therefore task migration is not a problem if one of the logical cores becomes idle However in a SMP or multi-core system in which the cache or parts of the cache is administrated by each core separately migrating tasks with a large working set may become problematicThis applies even more to NUMA machine where different CPU may be closer or more remote to the memory the process is using Therefore all this architectures have to be treated differently

The scheduling domain concept introduces scheduling domains a logical union of computing resources that share common properties with whom it is reasonable to treat them equally and CPU groups within these domains Those groups contain hardware-addressable computing resources that are part of the domain on which the balancer can try to even the domain load out

Scheduling domains are hierarchically nested ndash there is a top-level domain containing all other domains of the physical system the Linux is running on Depending on the actual architecture the sub-domains represent NUMA node groups physical CPU groups multi-core groups or SMT groups in a respective hierarchical nesting This structure is built automatically based on the actual topology of the system and for reasons of efficiency each CPU keeps a copy of every domain it belongs to For example a logical SMT processor that at the same time is a core in a physical multi-core processor on a NUMA node with multiple (SMP) processors would totally administer 4 sched_domain structures one for each level of parallel computing it is involved in ([15])

Figure 2 Example hierarchy in the Linux scheduling domains

Load-balancing takes place at scheduling domain level between the different groups Each domain level is sensitive with respect to the constraints set by its properties regarding load balancing For example load balancing happens very often between logical simultaneous

8

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 9: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

multithreading cores but very rarely on the NUMA level where remote memory access is costlyThe scheduling domain for multi-core processors was added with Kernel version 2617 ([16]) and especially considers the shared last level cache that multi-core architectures frequently possess Hence on a SMP machine with two multi-core packages two tasks will be scheduled on different packages if both packages are currently idle in order to make use of the overall larger cache

In recent Linux-kernels the multi-core processor scheduling domain also offers support for energy-efficient scheduling which can be used if eg the powersave governor is set in the cpufreq tool Saving energy can be achieved by changing the P- and the C-states of the cores in the different packages However P-states are transitions are made by adjusting the voltage and the frequency of a core and since there is only one voltage regulator per socket on the mainboard the P-state is dependent on the busiest core So as long as any core in a package is busy the P-state will be relatively low which corresponds to a high frequency and voltage While the P-states remain relatively fixed the C-states can be manipulated Adjusting the C-states means turning off parts of the registers blocking interrupts to the processor etc ([17]) and can be done on each core independently However the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has Therefore energy-efficiency is often limited to adjusting the C-state of a non-busy core while leaving other C-states and the packagesrsquo P-state low

Linux scheduling within the multi-core domain with the powersave-governor turned on will attempt to schedule multiple tasks on one physical package as long as it is feasible This way other multi-core packages will be allowed to transition into higher P- and C-states The author of [9] claims that generally the performance impact will be relatively low and that the performance losspower saving trade-off will be rewarding if the energy-efficient scheduling approach is used

22 Windows scheduler

In Windows scheduling is conducted on threads The scheduler is priority-based with priorities ranging from 0 to 31 Timeslices are allocated to threads in a round-robin fashion these timeslices are assigned to highest priority threads first and only if know thread of a given priority is ready to run at a certain time lower priority threads may receive the timeslice However if higher-priority threads become ready to run the lower priority threads are preempted

In addition to the base priority of a thread Windows dynamically changes the priorities of low-prioritized threads in order to ensure ldquofeltrdquo responsiveness of the operating system For example the thread associated with the foreground window on the Windows desktop receives a priority boost After such a boost the thread-priority gradually decays back to the base priority ([21])

Scheduling on SMP-systems is basically the same except that Windows keeps the notion of a threadrsquos processor affinity and an ideal processor for a thread The ideal processor is the processor with for example the highest cache-locality for a certain thread However if the ideal processor is not idle at the time of lookup the thread may just run on another processorIn [21] and other sources no explicit information is given on scheduling mechanisms especially specific to multi-core architectures though

9

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 10: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

23Solaris scheduler

In the Solaris scheduler terminology processes are called kernel- or user-mode threads dependant on the space in which they run User threads donrsquot only exist in the user space ndash whenever a user thread is created a so-called lightweight process is set up that connects the user thread to a kernel thread These kernel threads are object to schedulingSolaris 10 offers a number of scheduling classes which may co-exist on one system The scheduling classes provide an adaptive model for the specific types of applications which are intended to run on top of the operating system ([18]) mentions Solaris 10 scheduling classes for

bull Standard (timeshare) threads whose priority may be adapted by the schedulerbull Interactive applications (the currently active window in the graphical user interface)bull Fair sharing of system resources (instead of priority-based sharing)bull Fixed-priority threads The priority of threads scheduled with this scheduler does not

vary over the scheduling timebull System (kernel) threadsbull Real-time threads with a fixed priority and time share Threads in this scheduling

class may preempt system threads

The scheduling classes for timeshare fixed-priority and fair sharing are not recommended for simultaneous use on a system while other combinations of scheduling classes on a set of CPUs are possible

The timeshare and interactive schedulers are quite similar to the old Linux scheduler (before CFS) in their attempt of trying to identify IO bound processes and providing them with a priority boost Threads have a fixed time quantum they may use once they get the context and receive a new priority based on whether they fully consume their time quantum and on their waiting time for the context Fair share scheduling uses a fixed time quantum (share) allocated to processes ([19]) as a base for scheduling Different processes (actually collection of processes or in Solaris 10 terminology projects) compete for quanta on a computing resource and their position in that competition depends on how large the value they have been assigned is in relation to the total quanta number on the computing resource

Solaris explicitly deals with the scenario that parts of the processorrsquos resources may be shared as it is likely with typical multi-core processors There is a kernel abstraction called ldquoprocessor grouprdquo (pg_t) that is built according to the actual system topology and represents logical CPUs that share some physical properties (like caches or a common socket) These groupings can be investigated by the dispatcher eg in order to maintain logical CPU affinity for the purpose of cache-hotness where it is reasonable Quite similar to the concept of Linuxrsquos scheduling domains Solaris 10 tries to simultaneously achieve load balancing on multiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20])

10

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 11: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

3 Ongoing research on the topic

Research on multi-core scheduling deals with a number of different topics many of which are orthogonal (eg maximizing fairness and throughput) The purpose of this section is to present an interesting selection of different approaches to multi-core scheduling Sections 31 and 32 summarize proposals to improve fairness and load-balancing on current multi-core architectures while sections 33 and 34 concentrate on approaches for scheduling on promising new computing architectures such as multi-core processors with asymmetric performance and many-core CPUs

31Cache-Fairness

Several studies (eg [22] [23]) suggest that operating system schedulers insufficiently deal with threads that allocate large parts of the shared level 2 cache and thus slow-up threads running on the other core that uses the same cacheThe situation is unsatisfactory due to several reasons First it can lead to unpredictable execution times and throughput and second scheduling priorities may loose their effectiveness because of threads running on cores with aggressive ldquoco-runnersrdquo (ie threads running on another core in the same package) Figure 3 shows such a scenario Thread B uses the larger part of the shared cache and thus maybe negatively influences the cycles per instruction that thread A achieves during its CPU time share

L2-cache-misses are more costly than L1-cache-misses because the latency to the memory is bigger than to the next cache-level However it is mostly the L2-cache that is shared among different coresThe authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system

Figure 3 Unfair cache utilization by thread B

In their scheduling algorithm the threads on a system are grouped into a best effort class and a cache-fair class Best effort threads are penalized for the sake of performance stability of cache-fair threads if necessary but not vide-versa However it is taken care that this does not result in inadequate discrimination of best effort threads Fairness is enforced by allocating longer time shares to cache-fair threads that suffer from cache-intensive co-runners at the expense of these co-runners if they are best effort threads Figure 4 illustrates that process

11

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 12: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 4 Restoring fairness by adjusting timeshares

In order to compute the quantum that the thread is entitled to the algorithm uses an analytical model to estimate a few reference values that would hold if the thread had run under fair circumstances namely the fair L2 cache miss rate the fair CPI rate and the fair number of instructions (during a quantum) All those values are based on the assumption of fair circumstances so the difference to the actual values is computed and a quantum extension is calculated which should serve as ldquocompensationrdquo

Those calculations are done once for new cache-fair class threads ndash their L2 cache miss rates and other data is measured with different co-runners Subsequently the dispatcher periodically selects best-effort threads from whom he takes parts of their quanta and assigns them as ldquocompensationrdquo to the cache-fair threads New cache-fair threads are not explicitly scheduled with different co-runners in the analysis phase but whenever new combinations of cache-fair threads with co-runners occur analytical input is mined from the CPUrsquos hardware counters

The authors of [22] state that according to their measurements the penalties on best-effort threads are low while the algorithm actually enforces priority better than standard scheduling and improves fairness These claims are supported by experimental data gathered using the SPEC CPU2000 suite on an UltraSPARC T1 processor The experiments measure the time it takes a thread to complete a specific benchmark while a cache-intensive second thread is executed in the same package The execution times under these circumstances are compared with the times of threads running with co-runners with low-cache requirements This comparison shows differences of up to 37 in execution time between the two scenarios on a system with a standard scheduler while using the cache-fair scheduler the variability decreases to 7At the same time however measurements of the execution times of threads in the best-effort scheduling class reveal a slowdown of up to 8 for some threads (while some even experience a speed-up)

32Balancing core assignment

Fedorova et al ([27]) argue that during operating system scheduling several aspects have to be considered besides optimal performance It is shown that scheduling tasks on the cores in an imbalanced way results in jittery performance (ie unpredictability of a tasks completion time) and as a consequence insufficient priority enforcement It could be part of the operating system scheduler to ensure that the jobs are evenly assigned to cores In the approach described in [27] this is done by using a self-tuning scheduling algorithm based on a per-core benefit function This benefit function is based on three input components

12

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 13: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

1) The normalized core preference of a thread which is based on the instructions per cycle that a thread j can achieve on a certain core i ( ijIPC ) normalized by max(

kjIPC ) (where k is an arbitrary CPUcore)2) The cache-affinity a value which is 1 if the thread j was scheduled on core i within a

tuneable time period and 0 otherwise3) The average cache investment of a thread on a core which is determined by inspecting

the hardware cache miss counter from time to time

This benefit function can then be used to determine whether it would be feasible to migrate threads from core i to core j For each core a benefit function 0iB is computed that represents the case of no thread migrations taking place For each thread k on a core and updated benefit value for the hypothetical scenario of the migration of the thread onto another core minuskiB is computed Of course the benefit will increase since fewer threads are executed on the core But the thread that would have been taken away in the hypothetical scenario would have to be migrated to another core which would influence the benefit value of the target core Therefore also the updated benefit value of any other system core j to which the thread in question would be migrated to has to be computed and is called +kjB The hypothetical migration of thread k from core i to core j becomes reality if

RTFCABjikjki DbDaBBBB 00 +++gt+ +minus CABD represents a system-wide balance-constraint while RTFD ensures a per-job response-time-fairness (ie the slowdown that results for the thread in question from the thread-migration does not exceed some maximum value)

These two constants together with the criterions included in the benefit function itself (most notably cache-affinity) should help to ensure that the self-tuning fulfils the three goals of optimal performance core-assignment balance and response-time-fairness

However the authors have not yet actually implemented the scheduling modification in Linux and hence the statements on its effectiveness remain somehow theoretical

33Performance asymmetry

It has been advocated that building multi-core chips with asymmetric performance of the different cores can have advantages for the overall processing speed of a CPU For example it can prove beneficial if one fast core can be used to speed up parts that can hardly be parallelized while multiple slower cores come to play when parallel code parts are executed By keeping the cores for parallel execution slower than the core(s) for serial execution die-area and cost can be saved ([24]) while power consumption may be reduced[25] closely analyzes the impact of performance asymmetry on the average speedup of an application with increase of cores The paper concludes that ldquo[a]symmetric multicore chips can offer maximum speedups that are much greater than symmetric multicore chips (and never worse)rdquo Hence performance asymmetry at the processor core level seems to be a promising approach for future multi-core architectures [26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderately parallelized applications but donrsquot scale much worse with highly parallel programs (see Figure 5)

13

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 14: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 5 Comparison of speedup with SMP and AMP using highly parallel programs (left) moderately parallel programs (middle) and highly sequential programs (right)1

Apart from explicit design performance asymmetry can also occur in initially symmetric multi-core processors by either power-saving mechanisms (increasing the C-state) or failure of transistors that leads to disabling of parts of the corersquos components

Problems arise by the fact that the programmer usually assumes symmetric performance of the processor cores and designs her programs accordingly Therefore the operating system scheduler should support processor performance asymmetry which is currently not the case for most schedulers However it would be imaginable to see this as a Linux scheduling domain in the future

[5] describes AMPS an approach on how a performance-asymmetry aware scheduler for Linux could look like Basically the scheduler consists of three components An asymmetry-specific load balancing a faster-core first scheduler and a migration mechanism specifically for NUMA machines that will not be covered in detail here The scheduling mechanism tries to achieve a better performance fairness (with respect to the thread priorities) and repeatability of performance resultsIn order to conduct load-balancing the core performance is assessed in a first step AMPS measures core performance at boot time using benchmarks and sets the performance quantifier for the slowest core to 1 and for faster cores to a number higher than 1 Then the scaled load of a core is the run-queue length of a core divided by its performance quantifier If this scaled load is within a maximum and a minimum threshold then the system is considered to be load-balanced otherwise threads will be migrated By the scaling factor faster cores will receive more workload than slower ones

Besides the load-balancing cores with lower scaled load are preferred in thread placement Whenever new threads are scheduled the new scaled load that would apply if the thread was scheduled to one specific core is computed Then the thread is scheduled to the core with the least new scaled load in case of a tie faster cores are preferred Additionally the load-balancer may migrate threads even if the source core of migration can become idle by doing it AMPS only migrates threads if the new scaled load of the destination core does not exceed

1 Image source httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

14

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 15: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

the scaled load on the source-core This way the benefits of available fast cores can be fully exploited while not overburdening themIt can be expected that frequent core migration results in performance loss by cache misses (eg in the level cache) However experimental results in [5] reveal no excessive performance loss by the fact that task migration among cores occurs more often than in standard schedulers

Figure 6 Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2

Instead performance on SMP systems measurably improves (see Figure 6 standard scheduler speed would be 1 median speedup is 116) while fairness and repeatability is preserved better than with standard Linux (there is less deviation in the execution times of different processes)

34Scheduling on many-core architectures

Current research points at the potential of future CPU architectures that consist of a multitude of different cores (tens to hundreds) in order to prolong the time the industry can keep up with Moorersquos prediction Such CPU architectures are going to require novel approaches in thread scheduling since the available shared memory per core is potentially smaller while main memory access time increases and single-threaded performance decreases Even more so than with current CMP chips schedulers that treat such large scale CMP architectures just like SMP systems are going to fail with respect to performance

[28] identifies three major challenges in scheduling on many-core architectures namely the comparatively small cache sizes which render memory access costly the fact that non-specialized programmers will have to program code for them and the wide range of application scenarios that have to be considered for such chips The latter two challenges result from the projected general-purpose use of future many-core architectures

In order to deal with these challenges an OS-independent experimental runtime-environment called McRT is presented in [28] The runtime environment was built from scratch and is

2 Picture taken from [5]

15

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 16: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

independent of operating system kernels ndash hence the performed operations occur at user-level The connection to the underlying operating system is established using so called host adapters while programming environments like pthreads or OpenMP can invoke the scheduling environment via client adaptors For programmability it provides high level transactions (instead of locks) and the heterogeneity is alleviated by giving the user the choice among different runtime policies (which influence the scheduling behaviour)

The overall target of the scheduler is to maximize resource utilization and provide the user with flexible scheduler configurability Basically the scheduler is configured using three parameters P Q and T which respectively denote the number of (logical) processors task queues and threading abstractions P and Q change the scheduling behaviour from strict affinity to work stealing T can be used to specify different behaviour for different threads This way the concept of scheduler domains can be realized

It is notable that the scheduling system does not use pre-emption instead threads are expected to yield at certain times This design choice has been motivated by the authorsrsquo belief that pre-emption stems from the need of time-sharing expensive and fast resources which will become obsolete with many-core architectures The runtime actively supports constructs such as barriers which are often needed in parallel programming Barriers are designed to avoid busy waiting ndash for example a thread yields once it has reached the barrier but wonrsquot be re-scheduled until all other threads have reached the barrierWith a pre-emptive scheduling mechanism the thread would receive the context from time to time just to check whether other threads have reached the barrier ndash with the integrated barrier support based on a co-operative scheduling approach used in McRT this wonrsquot happen

The client-side adaptor eg for OpenMP promises to directly translate many OpenMP constructs to the McRT API

[28] also contains a large section on experimental results from benchmarks of typical desktop and scientific applications such as the XviD MPEG4 encoder singular value decomposition (SVD) and neural networks (SOM) The results were gathered on a cycle-accurate many-core simulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores which form one physical core 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chip L3 cache The simulator provides a cost-free MWait instruction that allows a thread to tell the processor that it is blocked and the resource it is waiting for Only if the resource becomes available the CPU will execute the thread Hence threads that are waiting barriers and locks donrsquot consume system resources It is important to consider that such a mechanism does not exist on current physical processors when viewing the experimental speedup results for McRT

16

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 17: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

Figure 7 Speedup of XviD encoding in the McRT framework (compared to single core performance)3

The experiments reveal that XviD encoding scales very well on McRT (Figure 7 1080p and 768p denote different video resolutions the curve labelled ldquolinearrdquo models the ideal speedup)

However the encoding process was explicitly tweaked for the many-core scenario ndash under the condition that only very little fast memory exists per logical core parallelization wasnrsquot conducted at frame level ndash instead single frames were partitioned into parts of frames which were encoded in parallel using OpenMP

The scalability of SVD and SOM is quite similar to that of XviD more figures and evaluations of different scheduling strategies can be found in [28]

3 Picture is from [28]

17

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 18: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

4 Conclusion

The probability is high that processor architectures will undergo extensive changes in order to keep up with Moorersquos law in the future AMPs and many-core CPUs are just two proposals for innovative new architectures that may help in prolonging the time horizon within which Moorersquos law can stay valid Operating system schedulers are going to have to adapt to the changing underlying architectures

The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility ndash because computing architectures that exhibit different behaviours with regard to memory access or single-threaded performance can be quite easily integrated into the load-balancing hierarchy however it must be argued that in the future probably more will have to be done in the scheduling algorithms themselves But scheduler domains at least provide the required flexibility at the level of threads

Sadly it has to be said that current Windows schedulers wonrsquot scale with the number of cores or performance asymmetry at any rate Basically the Windows scheduler treats multi-core architecture like SMP systems and hence can not give proper care to the peculiarities of such CPUs like the shared L2 cache (and possibly the varying or simply bad single-threaded performance that is going to be a characteristic of emerging future architectures) Ty Carlson director of technical strategy at Microsoft even mentioned at a panel discussion that current Windows releases (including Vista) were ldquodesigned to run on 12 maybe 4 processorsrdquo4 but wouldnrsquot scale beyond He seems to be perfectly right when he says that future versions of Windows would have to be fundamentally redesigned

Current research shows the road such a redesign could follow The approach described in section 33 seems to perform pretty well for multi-core processors with asymmetric performance The advantage is that the load balancing algorithm can be (and was) implemented as a modification to current operating system kernels and hence can be made available quickly once asymmetric architectures gain widespread adaption Experimental evaluation of the scheduling algorithm reveals promising results also with regard to fairness

Section 34 presents the way in which futuristic schedulers on upcoming many-core architectures could operate The runtime environment McRT makes use of interesting techniques and the authors of the paper manage to intelligibly explain why pre-emptive scheduling is going to be obsolete on many-core architectures However their implementation is realized in user-space and burdens the programmeruser with a multitude of configuration options and programming decisions that are required in order for the framework to guarantee optimal performance[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT experimental assessments which could testify on its performance although planned havenrsquot been conducted yet It will be interesting to keep an eye on the further development of scheduling approaches for many-core architectures since they might gain fundamentally in importance in the future

Achieving fairness and repeatability on todayrsquos available multi-core architectures are the major design goals of the scheduling techniques detailed in sections 31 and 32 The first approach is justified by a number of experimental results that show that priorities are actually enforced much better than with conventional schedulers however it remains to be seen 4 See httpwwwnewscom8301-10784_3-9722524-7html

18

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 19: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

whether the complexity of the scheduling approach and the amount of overhead potentially introduced by it justify that improvement Maybe it would be advantageous to consider implementing such mechanisms already at the hardware level if possible

The algorithm mentioned in section 32 hasnrsquot been implemented yet so it is an open question whether such a rather complicated load-balancing algorithm would be feasible in practice From the description one can figure that it takes a lot of computations and thread migrations in order to ensure the load-balance and it would be interesting to see the overhead from the computations and the cache-misses imposed by the mechanism on the system Without any experimental data those figures are hard to assess

19

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 20: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

5 References

[1] G E Moore bdquoCramming more components onto integrated circuitsldquo Electronics Volume 38 Number 8 1965

[2] bdquoWhy Parallel Processingldquo httpwwwtccornelleduServicesEducationTopicsParallelConcepts2+Why+Parallel+Processinghtm

[3] O Wechsler bdquoInside Intelreg Coretrade Microarchitectureldquo Intel Technology Whitepaper

httpdownloadintelcomtechnologyarchitecturenew_architecture_06pd f

[4] bdquoKey Architectural Features AMD Athlontrade X2 Dual-Core Processorsldquo httpwwwamdcomus-enProcessorsProductInformation030_118_9485_130415E1304300html

[5] Li et al bdquoEfficient Operating System Scheduling for Performance-Asymmetric Multi-Core Architecturesldquo In International conference on high performance computing networking storage and analysis 2007

[6] Balakrishnan et al bdquoThe Impact of Performance Asymmetry in Emerging Multicore Architecturesrdquo In Proceedings of the 32nd Annual International Symposium on Com-puter Architecture pages 506ndash517 June 2005

[7] M Annavaram E Grochowski and J Shen ldquoMitigating Amdahlrsquos law through EPI throttlingrdquo In Proceedings of the 32nd Annual International Symposium on ComputerArchitecture pages 298ndash309 June 2005

[8] V Pallipadi SB Siddha ldquoProcessor Power Management features and ProcessScheduler Do we need to tie them togetherrdquo In LinuxConf Europe 2007

[9] SB Siddha ldquoMulti-core and Linux Kernelrdquo httpossintelcompdfmclinuxpdf

[10] httpkernelnewbiesorgLinux_2_6_23

[11] httplwnnetArticles230574

[12] J Andrews ldquoLinux The Completely Fair Schedulerldquo httpkerneltraporgnode8059

[13] A Kumar ldquoMultiprocessing with the Completely Fair Schedulerrdquohttpwwwibmcomdeveloperworkslinuxlibraryl-cfsindexhtml

[14] Kernel 267 Changelog httpwwwkernelorgpublinuxkernelv26ChangeLog-267

[15] Scheduling domainshttplwnnetArticles80911

[16] Kernel 2617 Changeloghttpwwwkernelorgpublinuxkernelv26ChangeLog-2617

20

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References
Page 21: Operating System Scheduling on multi-core architectures

Seminar ldquoParallel Computingldquo Summer term 2008

[17] T Kidd ldquoC-states C-states and even more C-statesrdquo httpsoftwareblogsintelcom20080327update-c-states-c-states-and-even-more-c-states

[18] Solaris 10 Process Schedulinghttpwwwprincetonedu~unixSolaristroubleshootschedulehtml

[19] Solaris manpage FSS(7)httpdocssuncomappdocsdoc816-5177fss-7l=deampa=viewampq=FSS

[20] Eric Saxe ldquoCMT and Solaris Performancerdquohttpblogssuncomesaxeentrycmt_performance_enhancements_in_solaris

[21] MSDN section on Windows schedulinghttpmsdnmicrosoftcomen-uslibraryms68509628VS8529aspx

[22] A Fedorova M Seltzer and M D Smith ldquoCache-Fair Thread Scheduling for Multicore Processorsrdquo Technical Report TR-17-06 Harvard University Oct 2006

[23] S Kim D Chandra and Y Solihin ldquoFair Cache Sharing and Partitioning in a Chip Multiprocessor Architecturerdquo In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques 2004

[24] D Menasce and V Almeida ldquoCost-Performance Analysis of Heterogeneity in Supercomputer Architecturesrdquo In Proceedings of the 4th International Conference on Supercomputing June 1990

[25] M D Hill and M R Marty ldquoAmdahlrsquos Law in the Multicore Erardquo In IEEE Computer 2008

[26] B Crepps ldquoImproving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessingrdquo Intel Technology Magazine httpwwwintelcomtechnologymagazineresearchpower-efficiency-0206htm

[27] A Fedorova D Vengerov and D Doucette ldquoOperating System Scheduling on Heterogeneous Core Systemsrdquo to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures 2007

[28] B Saha et al ldquoEnabling Scalability and Performance in a Large Scale CMP Environmentrdquo Proceedings of the 2nd ACM SIGOPSEuroSys European Conference on Computer Systems 2007

[29] M Rajagopalan B T Lewis and T A Anderson ldquoThread Scheduling for Multi-Core Platformsrdquo in Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems 2007

21

  • 1 Introduction
    • 11Why use multi-core processors at all
    • 12Whatrsquos so different about multi-core scheduling
      • 2 OS process scheduling state of the art
        • 21 Linux scheduler
          • 211The Completely Fair Scheduler
          • 212Scheduling domains
            • 22 Windows scheduler
            • 23Solaris scheduler
              • 3 Ongoing research on the topic
                • 31Cache-Fairness
                • 32Balancing core assignment
                • 33Performance asymmetry
                • 34Scheduling on many-core architectures
                  • 4 Conclusion
                  • 5 References