Upload
lorraine-cunningham
View
242
Download
0
Tags:
Embed Size (px)
Citation preview
NVIDIA Confidential
Agenda
Introduction
Linux schedulersO(1) Scheduler
Completely Fair Scheduler ( 2.6.23)
Supporting Data structuresTask Structure
Run Queue
Schedule Class and Schedule Entity
Process State Transition Diagram
Invocation of scheduler
Scheduler function Implementation
Conclusion
NVIDIA Confidential
Scheduler
Part of the kernel that controls allocation of the CPU to processes
Based on policy algorithms ,decides which process should allow to run , when and for how long
Provides basis for multi tasking OS
Attempts to satisfy two major goalsMaximum system utilization ( Higher throughput)
Fast response time ( low latency)
Scheduling can be activated via two methodsWhen task goes to sleep or yield CPU voluntarily
Periodically via timer interrupt
NVIDIA Confidential
O(1) Scheduler
Introduced in 2.6 Linux Kernel
Two runqueues per CPU, one active, one expired. Each run queue consists of linked lists for priority levels
Total 140 levels, first 100 for real-time tasks, last 40 for normal tasks
Only needs to look at the highest priority list to schedule the next task
Task insertion and deletion takes O(1)
Insertion , deletion and search– o(1)
NVIDIA Confidential
O(1) Scheduler
Scheduler inserts each runnable task into active run queue
Whenever the task runs out of its time slice,It is preempted, removed from active run queue, and inserted into expired run queue
If an active run queue becomes empty, the active run queue and expired run queue swap pointers
So the empty run queue becomes the expired run queue
Priorities and time slices of normal tasks are dynamically recalculated based on their characteristics (I/O or CPU bound) when two run queues are swapped
NVIDIA Confidential
Completely Fair Scheduler
• Introduced in Linux kernel 2.6.23
• Maintains balance (fairness) in providing processor time to tasks
The smaller a task's virtual runtime - higher its need for the processor
• Maintains the amount of time provided to a given task in virtual runtime
NVIDIA Confidential
Completely Fair Scheduler
Introduced concept of Time – ordered Red Black tree to maintain runqueue
Self Balancing
Insertion/Deletion and search – O(log n)
Tasks are sorted in increasing order of virtual runtime
Virtual time is computed by following formulaVirtual RuntimeT= (W0 / WT )*Actual RuntimeT
W0 – Weight of Nice 0 value
WT – Weight of Task T
NVIDIA Confidential
CFS
Time IntervalTime interval for which the task is allowed to run without being preempted
Task T’s time slice is proportional to its weight
TimeT* WL
Q : the set of runnable tasks
WL : the constant for given workload
Sched_latency .. If n > nr_latency
WL =
min_granularity * n else
n : the number of tasks
In current Linux implementation,sched_latency : 6, nr_latency : 8, min_granularity :0.75
NVIDIA Confidential
Completely Fair Scheduler
34
27
22 31
2
44
37 47
45 51
NIL
NIL NILNIL
NIL NIL
Virtual Runtime
Most Need of CPU Less Need of CPU
NVIDIA Confidential
CFS- Algorithm
On each scheduling tick, CFS Subtracts the currently running task’s time slice by tick period
When the time slice reaches 0, NEED_RESCHED flag is set
Updates the virtual runtime of the currently running task
Virtual runtime is computed, checks NEED_RESCHED flag
If set, schedules the task with the smallest virtual runtime in the run queue (the left-most node in the red-black tree)
NVIDIA Confidential
Scheduler – Supporting Data Structures
Supporting DSTask Structure
Run Queue
Scheduler Entity
Scheduler Class
NVIDIA Confidential
Process Descriptor
Kernel create process descriptor for each task
Defined via task_struct structure
When a process/thread is created, the kernel allocates a new task_struct for it
Kernel stores list of processes in a circular doubly linked list called task list
Each element of task list is process descriptor of type struct task_struct
Task_struct – defined in linux/sched.h
NVIDIA Confidential
struct task_struct {volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
int prio, static_prio, normal_prio;
unsigned int flags;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
unsigned int policy;
Cpumask_t cpus_allowed;
…
NVIDIA Confidential
Task_struct - State
State: this field describes current state of process.TASK_RUNNING: The process is runnable; it is either currently running or on a running queue waiting to run.
TASK_INTERRUPTIBLE: The process is sleeping (that is, it is blocked), waiting for some condition to exist.
TASK_UNINTERRUPTIBLE: The process is sleeping. It does not wake up and become runnable if it receives a signal.
TASK_ZOMBIE: The task has terminated, but its parent has not yet issued a wait() system call.
TASK STOPPED: Process execution has stopped; the task is not running nor is it eligible to run.
NVIDIA Confidential
Task_struct – Priority Fields
Three Priority Fields :Prio & Normal – indicate dynamic priority
Rt_priority – denotes priority of real time process
Static priority
static_prio: is static priority of a process. Priority assigned to process when it is started
The value of this field does not get changed during process run time
Static priority is also called nice value which ranges from -20 to 19.
Can be modified by nice or sched_setscheduler system calls
NVIDIA Confidential
Task_struct – Priority Fields
normal_prio: holds expected priority of a process.
Computed based on static priority and scheduling policy
In most cases, for non real-time processes, values of normal_prio and static_prio are the same.
rt_priority: used for real-time process.
Competition among real-time tasks is strictly based upon rt_priority.
Lowest priority value – 0, Highest value – 99. Highest value corrosponds to highest priority
Prio:Priority considered by scheduler is kept in this field
NVIDIA Confidential
Calculating Priority
Kernel uses simple scale ranging from 0 to 139 to represent priorities internally
Lower values means Higher priority
Range 0 to 99 is reserved for real time processes
Normal processes uses range from 100 to 139
Nice values[-20,19] are mapped to range from 100 to 139
Real Time Processes
0 99
100 139
-20 19
Nice Values
Normal Process
NVIDIA Confidential
Calculating Priority
<sched .h>
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_PRIO (MAX_RT_PRIO + 40)
#define DEFAULT_PRIO (MAX_RT_PRIO + 20)
<kernel/sched.c>
#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)
#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)
NVIDIA Confidential
Calculating Load
Load of each process is computed based on process types and its priority
Function set_load_weight is called to calculate load of individual process
Load_weight structure keeps track of process loadstruct load_weight
{ unsigned long weight,inv_weight
}
It keeps track of both load as well as other quantity that used to perform divisions by weight
NVIDIA Confidential
Task_struct – Scheduling Fields
sched_class: a pointer points to schedule class.
sched_entity: a pointer points to CFS schedule entity
sched_rt_entity: a pointer points to RT schedule entity
policy: holds a value of scheduling policies
NVIDIA Confidential
Task Structure – Policy Field
CFS implements three scheduling policies: - SCHED_NORMAL
Used for regular tasks.
Each task assigned a nice value( default – 0)
PRIO = MAX_RT_PRIO + NICE + 20
Assigned a time slice
Tasks at the same prio(rity) are round-robined
Ensures Priority + Fairness
SCHED_BATCH: Well suited for batch jobs
For computing-intensive tasks
Timeslices are long and processes are round robin scheduled
Lowest priority tasks are batch-processed (nice +19).
NVIDIA Confidential
Task Structure – Policy Field
SCHED_IDLE: Nice value has no influence
Extremely low priority (lower than +19 nice)
To avoid to get into priority inversion problems
RT implements two scheduling policies: -
Implemented scheduling for soft real time processes via SCHED_FIFO and SCHED_RR
SCHED_FIFO
Uses FIFO mechanism
Runs till time slice is completed or Voluntarily relinquish the CPU
Priority levels maintained
Not pre-empted
NVIDIA Confidential
Task Structure – Policy Field
SCHED_RR:
Uses round robin mechanism Assigned a timeslice and run till the timeslice is exhausted.
Once all RR tasks of a given priority level exhaust their timeslices, their timeslices are refilled and they continue running
Priority levels are maintained
NVIDIA Confidential
Run Queue
Run Queue:Defined in kernel/sched.c
Created for Each Processor
Contains list of runnable processes on a given processor
Fieldsnr_running – No of runnable task
Nr_switches – No of context switches
cfs – CFS Running Queue Structure
rt – Real time running queue structure
next_balance – timestamp to next load balance check
Curr – Pointer points to currently running task of this running queue
Idle – Pointer points to currently idle task of running queue
Lock – spin lock of running queue
NVIDIA Confidential
Support DSstruct rq
struct cfs_rq cfs
Struct rt_rq rt
Defined in kernel/sched.c
struct cfs_rq
ulong nr_running
u64 exec_clock
u64 Min_vruntime
struct rb_root task_timeline
Struct rb_node *rb_leftmost
struct sched_entity *curr,*next,*last
struct rt_rq
Struct rt_prio_array_active
Ul rt_nr_running
u64 rt_time
Struct {Int curr,int next}highest_prio
U64 rt_runtime
NVIDIA Confidential
Schedule Class
Extensible hierarchy of scheduler modules
Modules encapsulate scheduling policy details
Modules called from the scheduler core without the core code assuming too much about them
Implemented through the sched_class structure
Task belongs to a scheduling class, which determines how a task will be scheduled
Defines a common set of functions (via sched_class) that define the behavior of the scheduler
NVIDIA Confidential
Schedule Class
Tasks refer to their schedule policy struct task_struct.sched_class
Two Schedule classes Completely Fair Scheduler
Defined in kernel/sched_fair.c
Following CFS Algorithm
SCHED_NORMAL,SCHED_BATCH, and SCHED_IDLE
Real Time SchedulerDefined in kernel/sched_rt.c
Following real-time mechanism
SCHED_FIFO, SCHED_RR
NVIDIA Confidential
Schedule ClassSchedule Class
Enqueue_task
Dequeue_task
Yield_task
Check_preempt_task
Pick_new_task
Task_tick
CFS
Enqueue_task_fair
Dequeue_task_fair
Yield_task_fair
Check_preempt_wakeup
Pick_new_task_fair
Task_tick_fair
RT
Enqueue_task_rt
Dequeue_task_rt
Yield_task_rt
Check_preempt_curr_rt
Pick_new_task_rt
Task_tick_rt
NVIDIA Confidential
Schedule Class Functions
enqueue_task() : Called when task enters into runnable state
Puts scheduling entity into rb_tree/list and increments nr_running
dequeue_task() :Task is no longer runnable
Moves scheduling entity out of RB tree/list
Decrements nr_running
yield_task() :dequeue + enqueue
Places scheduling entity at rightmost end of RB tree or end of list
NVIDIA Confidential
Schedule Class Functions
Check_preempt_curr():checks if task that entered runnable state should preempt currently running task
Usually called after try_to_wakeup() function
Pick_new_task():Chooses most appropriate task eligible to run next
Picked up new task based on scheduling policy( priority/fairness)
Task_tick():Called from time_tick() function
Might lead to process process switch
NVIDIA Confidential
Scheduler Entity
CFS does not have notion of time slice
Keep track of task’s scheduling information
Includes the rb_node reference, load weight, and a variety of statistics data
sched_entity contains vruntime (64-bit field), which indicates the amount of time the task has run and serves as the index for the red-black tree.
Scheduler entity
Sched_entity Sched_rt_entity
NVIDIA Confidential
Struct task_struct{volatile long state;void *stack;Unsigned int flags;Int prio,static_prio,normal_prio;Const struct sched_class *sched_class;Struct sched_entity *se;}
Struct sched_entity{Struct load_weight load;Struct rb_node run_node
Exec_startVruntime
Sum_exec_runtime}
Struct rb_node{Struct rb_node *rb_rightStruct rb_node *rb_leftUnsigned long int color}
Struct cfs_rq{Struct rb_root task_timeline}
NVIDIA Confidential
Scheduler Initialization
Init/main.c
start_kernel()
sched_init()
rest_init()
Kernel_init()
Sched_smp_init()
schedule()
NVIDIA Confidential
Schedule() function Invocation
Schedule()
do_fork()
do_wait()
Do_exit()
Try_to_wake_up()
Timer Interrupt
NVIDIA Confidential
Process is created
kernel/fork.cDo_fork()
Wake_up_new_task()
Activate_task()
Check_preempt_curr()
Resched_task()
NVIDIA Confidential
Try_to_wake_up
kernel/sched.cTry_to_wake_up()
Ttwu_queue()
Ttwu_do_activate()
Ttwu_do_wakeup()
Check_preempt_curr()
Resched_task()