64
Scheduler Implementation in Linux Kernel Meher Chaitanya

Scheduler Implementation in Linux Kernel Meher Chaitanya

Embed Size (px)

Citation preview

Scheduler Implementation in Linux Kernel

Meher Chaitanya

NVIDIA Confidential

Agenda

Introduction

Linux schedulersO(1) Scheduler

Completely Fair Scheduler ( 2.6.23)

Supporting Data structuresTask Structure

Run Queue

Schedule Class and Schedule Entity

Process State Transition Diagram

Invocation of scheduler

Scheduler function Implementation

Conclusion

NVIDIA Confidential

Scheduler

Part of the kernel that controls allocation of the CPU to processes

Based on policy algorithms ,decides which process should allow to run , when and for how long

Provides basis for multi tasking OS

Attempts to satisfy two major goalsMaximum system utilization ( Higher throughput)

Fast response time ( low latency)

Scheduling can be activated via two methodsWhen task goes to sleep or yield CPU voluntarily

Periodically via timer interrupt

NVIDIA Confidential

O(1) Scheduler

Introduced in 2.6 Linux Kernel

Two runqueues per CPU, one active, one expired. Each run queue consists of linked lists for priority levels

Total 140 levels, first 100 for real-time tasks, last 40 for normal tasks

Only needs to look at the highest priority list to schedule the next task

Task insertion and deletion takes O(1)

Insertion , deletion and search– o(1)

NVIDIA Confidential

O(1) Scheduler

Scheduler inserts each runnable task into active run queue

Whenever the task runs out of its time slice,It is preempted, removed from active run queue, and inserted into expired run queue

If an active run queue becomes empty, the active run queue and expired run queue swap pointers

So the empty run queue becomes the expired run queue

Priorities and time slices of normal tasks are dynamically recalculated based on their characteristics (I/O or CPU bound) when two run queues are swapped

NVIDIA Confidential

Completely Fair Scheduler

• Introduced in Linux kernel 2.6.23

• Maintains balance (fairness) in providing processor time to tasks

The smaller a task's virtual runtime - higher its need for the processor

• Maintains the amount of time provided to a given task in virtual runtime

NVIDIA Confidential

Completely Fair Scheduler

Introduced concept of Time – ordered Red Black tree to maintain runqueue

Self Balancing

Insertion/Deletion and search – O(log n)

Tasks are sorted in increasing order of virtual runtime

Virtual time is computed by following formulaVirtual RuntimeT= (W0 / WT )*Actual RuntimeT

W0 – Weight of Nice 0 value

WT – Weight of Task T

NVIDIA Confidential

CFS

Time IntervalTime interval for which the task is allowed to run without being preempted

Task T’s time slice is proportional to its weight

TimeT* WL

Q : the set of runnable tasks

WL : the constant for given workload

Sched_latency .. If n > nr_latency

WL =

min_granularity * n else

n : the number of tasks

In current Linux implementation,sched_latency : 6, nr_latency : 8, min_granularity :0.75

NVIDIA Confidential

Completely Fair Scheduler

34

27

22 31

2

44

37 47

45 51

NIL

NIL NILNIL

NIL NIL

Virtual Runtime

Most Need of CPU Less Need of CPU

NVIDIA Confidential

CFS- Algorithm

On each scheduling tick, CFS Subtracts the currently running task’s time slice by tick period

When the time slice reaches 0, NEED_RESCHED flag is set

Updates the virtual runtime of the currently running task

Virtual runtime is computed, checks NEED_RESCHED flag

If set, schedules the task with the smallest virtual runtime in the run queue (the left-most node in the red-black tree)

NVIDIA Confidential

Scheduler – Supporting Data Structures

Supporting DSTask Structure

Run Queue

Scheduler Entity

Scheduler Class

NVIDIA Confidential

Process Descriptor

Kernel create process descriptor for each task

Defined via task_struct structure

When a process/thread is created, the kernel allocates a new task_struct for it

Kernel stores list of processes in a circular doubly linked list called task list

Each element of task list is process descriptor of type struct task_struct

Task_struct – defined in linux/sched.h

NVIDIA Confidential

Process Discriptor

NVIDIA Confidential

struct task_struct {volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

int prio, static_prio, normal_prio;

unsigned int flags;

unsigned int rt_priority;

const struct sched_class *sched_class;

struct sched_entity se;

struct sched_rt_entity rt;

unsigned int policy;

Cpumask_t cpus_allowed;

NVIDIA Confidential

Task_struct - State

State: this field describes current state of process.TASK_RUNNING: The process is runnable; it is either currently running or on a running queue waiting to run.

TASK_INTERRUPTIBLE: The process is sleeping (that is, it is blocked), waiting for some condition to exist.

TASK_UNINTERRUPTIBLE: The process is sleeping. It does not wake up and become runnable if it receives a signal.

TASK_ZOMBIE: The task has terminated, but its parent has not yet issued a wait() system call.

TASK STOPPED: Process execution has stopped; the task is not running nor is it eligible to run.

NVIDIA Confidential

Task_struct – Priority Fields

Three Priority Fields :Prio & Normal – indicate dynamic priority

Rt_priority – denotes priority of real time process

Static priority

static_prio: is static priority of a process. Priority assigned to process when it is started

The value of this field does not get changed during process run time

Static priority is also called nice value which ranges from -20 to 19.

Can be modified by nice or sched_setscheduler system calls

NVIDIA Confidential

Task_struct – Priority Fields

normal_prio: holds expected priority of a process.

Computed based on static priority and scheduling policy

In most cases, for non real-time processes, values of normal_prio and static_prio are the same.

rt_priority: used for real-time process.

Competition among real-time tasks is strictly based upon rt_priority.

Lowest priority value – 0, Highest value – 99. Highest value corrosponds to highest priority

Prio:Priority considered by scheduler is kept in this field

NVIDIA Confidential

Calculating Priority

Kernel uses simple scale ranging from 0 to 139 to represent priorities internally

Lower values means Higher priority

Range 0 to 99 is reserved for real time processes

Normal processes uses range from 100 to 139

Nice values[-20,19] are mapped to range from 100 to 139

Real Time Processes

0 99

100 139

-20 19

Nice Values

Normal Process

NVIDIA Confidential

Calculating Priority

<sched .h>

#define MAX_USER_RT_PRIO 100

#define MAX_RT_PRIO MAX_USER_RT_PRIO

#define MAX_PRIO (MAX_RT_PRIO + 40)

#define DEFAULT_PRIO (MAX_RT_PRIO + 20)

<kernel/sched.c>

#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)

#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)

#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)

NVIDIA Confidential

Calculating Priority

NVIDIA Confidential

Calculating Load

Load of each process is computed based on process types and its priority

Function set_load_weight is called to calculate load of individual process

Load_weight structure keeps track of process loadstruct load_weight

{ unsigned long weight,inv_weight

}

It keeps track of both load as well as other quantity that used to perform divisions by weight

NVIDIA Confidential

Calculating Load

NVIDIA Confidential

Calculating Load

NVIDIA Confidential

Task_struct – Scheduling Fields

sched_class: a pointer points to schedule class.

sched_entity: a pointer points to CFS schedule entity

sched_rt_entity: a pointer points to RT schedule entity

policy: holds a value of scheduling policies

NVIDIA Confidential

Task Structure – Policy Field

CFS implements three scheduling policies: - SCHED_NORMAL

Used for regular tasks.

Each task assigned a nice value( default – 0)

PRIO = MAX_RT_PRIO + NICE + 20

Assigned a time slice

Tasks at the same prio(rity) are round-robined

Ensures Priority + Fairness

SCHED_BATCH: Well suited for batch jobs

For computing-intensive tasks

Timeslices are long and processes are round robin scheduled

Lowest priority tasks are batch-processed (nice +19).

NVIDIA Confidential

Task Structure – Policy Field

SCHED_IDLE: Nice value has no influence

Extremely low priority (lower than +19 nice)

To avoid to get into priority inversion problems

RT implements two scheduling policies: -

Implemented scheduling for soft real time processes via SCHED_FIFO and SCHED_RR

SCHED_FIFO

Uses FIFO mechanism

Runs till time slice is completed or Voluntarily relinquish the CPU

Priority levels maintained

Not pre-empted

NVIDIA Confidential

Task Structure – Policy Field

SCHED_RR:

Uses round robin mechanism Assigned a timeslice and run till the timeslice is exhausted.

Once all RR tasks of a given priority level exhaust their timeslices, their timeslices are refilled and they continue running

Priority levels are maintained

NVIDIA Confidential

Scheduling Policy

NVIDIA Confidential

Run Queue

Run Queue:Defined in kernel/sched.c

Created for Each Processor

Contains list of runnable processes on a given processor

Fieldsnr_running – No of runnable task

Nr_switches – No of context switches

cfs – CFS Running Queue Structure

rt – Real time running queue structure

next_balance – timestamp to next load balance check

Curr – Pointer points to currently running task of this running queue

Idle – Pointer points to currently idle task of running queue

Lock – spin lock of running queue

NVIDIA Confidential

Support DSstruct rq

struct cfs_rq cfs

Struct rt_rq rt

Defined in kernel/sched.c

struct cfs_rq

ulong nr_running

u64 exec_clock

u64 Min_vruntime

struct rb_root task_timeline

Struct rb_node *rb_leftmost

struct sched_entity *curr,*next,*last

struct rt_rq

Struct rt_prio_array_active

Ul rt_nr_running

u64 rt_time

Struct {Int curr,int next}highest_prio

U64 rt_runtime

NVIDIA Confidential

Schedule Class

Extensible hierarchy of scheduler modules

Modules encapsulate scheduling policy details

Modules called from the scheduler core without the core code assuming too much about them

Implemented through the sched_class structure

Task belongs to a scheduling class, which determines how a task will be scheduled

Defines a common set of functions (via sched_class) that define the behavior of the scheduler

NVIDIA Confidential

Schedule Class

Tasks refer to their schedule policy struct task_struct.sched_class

Two Schedule classes Completely Fair Scheduler

Defined in kernel/sched_fair.c

Following CFS Algorithm

SCHED_NORMAL,SCHED_BATCH, and SCHED_IDLE

Real Time SchedulerDefined in kernel/sched_rt.c

Following real-time mechanism

SCHED_FIFO, SCHED_RR

NVIDIA Confidential

Schedule ClassSchedule Class

Enqueue_task

Dequeue_task

Yield_task

Check_preempt_task

Pick_new_task

Task_tick

CFS

Enqueue_task_fair

Dequeue_task_fair

Yield_task_fair

Check_preempt_wakeup

Pick_new_task_fair

Task_tick_fair

RT

Enqueue_task_rt

Dequeue_task_rt

Yield_task_rt

Check_preempt_curr_rt

Pick_new_task_rt

Task_tick_rt

NVIDIA Confidential

Schedule Class Functions

enqueue_task() : Called when task enters into runnable state

Puts scheduling entity into rb_tree/list and increments nr_running

dequeue_task() :Task is no longer runnable

Moves scheduling entity out of RB tree/list

Decrements nr_running

yield_task() :dequeue + enqueue

Places scheduling entity at rightmost end of RB tree or end of list

NVIDIA Confidential

Schedule Class Functions

Check_preempt_curr():checks if task that entered runnable state should preempt currently running task

Usually called after try_to_wakeup() function

Pick_new_task():Chooses most appropriate task eligible to run next

Picked up new task based on scheduling policy( priority/fairness)

Task_tick():Called from time_tick() function

Might lead to process process switch

NVIDIA Confidential

Scheduler Entity

CFS does not have notion of time slice

Keep track of task’s scheduling information

Includes the rb_node reference, load weight, and a variety of statistics data

sched_entity contains vruntime (64-bit field), which indicates the amount of time the task has run and serves as the index for the red-black tree. 

Scheduler entity

Sched_entity Sched_rt_entity

NVIDIA Confidential

Struct task_struct{volatile long state;void *stack;Unsigned int flags;Int prio,static_prio,normal_prio;Const struct sched_class *sched_class;Struct sched_entity *se;}

Struct sched_entity{Struct load_weight load;Struct rb_node run_node

Exec_startVruntime

Sum_exec_runtime}

Struct rb_node{Struct rb_node *rb_rightStruct rb_node *rb_leftUnsigned long int color}

Struct cfs_rq{Struct rb_root task_timeline}

NVIDIA Confidential

Scheduler Initialization

Init/main.c

start_kernel()

sched_init()

rest_init()

Kernel_init()

Sched_smp_init()

schedule()

NVIDIA Confidential

Scheduler Initialization

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

Process State Transition Diagram

NVIDIA Confidential

Schedule() function Invocation

Schedule()

do_fork()

do_wait()

Do_exit()

Try_to_wake_up()

Timer Interrupt

NVIDIA Confidential

Process is created

kernel/fork.cDo_fork()

Wake_up_new_task()

Activate_task()

Check_preempt_curr()

Resched_task()

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

NVIDIA Confidential

Exit

NVIDIA Confidential

Wait

NVIDIA Confidential

Wait

NVIDIA Confidential

Try_to_wake_up

kernel/sched.cTry_to_wake_up()

Ttwu_queue()

Ttwu_do_activate()

Ttwu_do_wakeup()

Check_preempt_curr()

Resched_task()

NVIDIA Confidential

Try_to_wake_up

NVIDIA Confidential

Try_to_wake_up

NVIDIA Confidential

Try_to_wake_up

NVIDIA Confidential

Timer Interrupt

NVIDIA Confidential

Timer Interrupt

NVIDIA Confidential

Timer Interrupt

NVIDIA Confidential

Schedule Function

NVIDIA Confidential

NVIDIA Confidential