65
CS533 Concepts of Operating Systems The Performance of Micro- Kernel Based Systems

CS533 Concepts of Operating Systems

  • Upload
    luz

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

CS533 Concepts of Operating Systems. The Performance of Micro-Kernel Based Systems. Micro-kernels and Binary Compatibility . Emulation libraries Trampoline mechanism Single server architecture Multi-server architecture - PowerPoint PPT Presentation

Citation preview

Page 1: CS533 Concepts of Operating Systems

CS533 Concepts of Operating Systems

The Performance of Micro-Kernel Based Systems

Page 2: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 2

Micro-kernels and Binary Compatibility

Emulation librarieso Trampoline mechanism

Single server architecture Multi-server architecture

o IPC overhead proportional to number of servers (independent protection domains)

Page 3: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 3

Micro-kernels must optimize IPC

Liedtke argues Mach’s overhead is due to poor implementation!

Optimized IPC implementation in L3o Architectural level

• System Calls, Messages, Direct Transfer, Strict Process Orientation, Control Blocks.

o Algorithmic level• Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy

Scheduling, Direct Process Switch, Short Messages.o Interface level

• Unnecessary Copies, Parameter passing.o Coding level

• Cache Misses, TLB Misses, Segment Registers, General Registers, Jumps and Checks, Process Switch.

Page 4: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 4

L3 IPC Performance vs Mach IPC

Page 5: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 5

But Is That Enough?

What is the impact on overall system performance?

Haertig et al explore performance and extensibility of L4-based Linux OS vs Mach-based Linux and native Linux

Page 6: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 6

L4Linux – a micro-kernel based Linux

Fully binary compliant with Linux/X86 No changes to the architecture-independent

parts of Linux No Linux-specific modifications to the L4 kernel

Page 7: CS533 Concepts of Operating Systems

L4Linux – Design & Implementation

Linux implemented as a single Linux server in a μ-kernel task

μ-kernel tasks used for Linux user processes A single L4 thread in the Linux server handles system calls

and page faults. This thread is multiplexed (treated as a virtual CPU)

On booting, the Linux server requests memory from its pager, which maps physical memory into the server’s address space

The Linux server then acts as the pager for the user processes it creates

o L4 converts user-process page faults into an RPC to the Linux server, which maps pages from its address space to the user process

Page 8: CS533 Concepts of Operating Systems

Interrupt Handling

Linux top halves are implemented as one server thread per interrupt source

o L4 converts a hardware interrupt to a message to the appropriate thread

Linux bottom halves all execute in a single high priority thread

Linux interrupt threads have a higher priority than the main thread

o avoids concurrent execution of Linux code on a uniprocessor

Page 9: CS533 Concepts of Operating Systems

System Calls System calls implemented as IPC between user process

and the Linux server Modified libc.so or libc.a avoid trap instructions and use L4

IPC instead to call the Linux server User-level exception handler (trampoline) emulates the

native system call ‘trap’ instruction for binary compatibility

o L4 redirects trap to emulation library which then used L4 IPC to call the Linux server

Page 10: CS533 Concepts of Operating Systems

Signals Each user process has a separate signal-handler thread Linux server’s delivers a signal by sending a message to

the user process’s signal-handler thread The signal-handler causes the user process’s main thread

to save it’s state and enter Linux by manipulating the main thread’s SP and PC

Page 11: CS533 Concepts of Operating Systems

Scheduling

All thread scheduling is done by the L4 kernel The Linux server’s schedule() routine is only used for

multiplexing the Linux server’s Main thread across concurrent Linux system calls

Page 12: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 12

Experiment

What is the penalty of using L4Linux?o Compare L4Linux to native Linux

Does the performance of the underlying micro-kernel matter?

o Compare L4Linux to MkLinux Does co-location improve performance?

o Compare L4Linux to an in-kernel version of MkLinux

Page 13: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 13

Microbenchmarks

measured system call overhead on shortest system call “getpid()”

Linux 223 cyclesL4Linux 526 cyclesL4Linux(trampoline) 753 cyclesMkLinux (in kernel) 2050 cyclesMkLinux (user) 14710 cycles

Page 14: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 14

Microbenchmarks (cont.) Measures specific system calls to determine basic

performance.

Page 15: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 15

Macrobenchmarks

measured time to recompile Linux server

Page 16: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 16

Macrobenchmarks (cont.)

Next use a commercial test suite to simulate a system under full load.

Page 17: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 17

Performance Analysis

L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.

MkLinux: 49% slower on average, 60% at maximum.

Co-located MkLinux: 29% slower on average, 37% at maximum

Page 18: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 18

Conclusion?

Can hardware-based protection be made to work efficiently enough?

Did these experiments explore the cost of “fine grained” protection?

Page 19: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 19

Spare Slides

Page 20: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 20

The IPC Dilemma

IPC is very import in μ-kernel designo Increases modularity, flexibility, security and scalability.

Past implementations have been inefficient.o Message transfer takes 50 - 500μs.

Page 21: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 21

The L3 (μ-kernel based) OS

A task consists of:o Threads

• Communicate via messages that consist of strings and/or memory objects.

o Dataspaces• Memory objects.

o Address space • Where dataspaces are mapped.

Page 22: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 22

Redesign Principles

IPC performance is the Master. All design decisions require a performance discussion. If something performs poorly, look for new techniques. Synergetic effects have to be taken into considerations. The design has to cover all levels from architecture down to

coding. The design has to be made on a concrete basis. The design has to aim at a concrete performance goal.

Page 23: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 23

Achievable Performance

A simple scenarioo Thread A sends a null message to thread Bo Minimum of 172 cycles

Will aim at 350 cycles (7 μs)o Will actually achieve 250 cycles (5 μs)

Page 24: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 24

Levels of the redesign

Architecturalo System Calls, Messages, Direct Transfer, Strict Process Orientation,

Control Blocks. Algorithmic

o Thread Identifier, Virtual Queues, Timeouts/Wakeups, Lazy Scheduling, Direct Process Switch, Short Messages.

Interfaceo Unnecessary Copies, Parameter passing.

Codingo Cache Misses, TLB Misses, Segment Registers, General Registers,

Jumps and Checks, Process Switch.

Page 25: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 25

Architectural Level

System Callso Expensive! So, require as few as possible.o Implement two calls:

• Call• Reply & Receive Next

o Combines sending an outgoing message with waiting for an incoming message.

• Schedulers can handle replies the same as requests.

Page 26: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 26

Messages

Complex Messages:o Direct String, Indirect Strings (optional)o Memory Objects

Used to combine sends if no reply is needed. Can transfer values directly from sender’s variable to receiver’s variables.

A Complex Message

Page 27: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 27

Direct Transfer

Each address space has a fixed kernel accessible part.o Messages transferred via the kernel parto User A space -> Kernel -> User B spaceo Requires 2 copies.o Larger Messages lead to higher costs

User A

User B

Kernel

Page 28: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 28

Shared User Level memory (LRPC, SRC RPC)o Security can be penetrated.o Cannot check message’s legality.o Long messages -> address space becoming a critical

resource.o Explicit opening of communication channels.o Not application friendly.

Page 29: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 29

Temporary Mapping

L3 uses a Communication Windowo Only kernel accessible, and exists per address space.o Target region is temporarily mapped there.o Then the message is copied to the communication window and ends

up in the correct place in the target address space.

User A

User B

Kernel

Page 30: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 30

Temporary Mapping

Must be fast! 2 level page table only requires one word to be copied.

o pdir A -> pdir B TLB must be clean of entries relating to the use of the

communication window by other operations.o One thread

• TLB is always “window clean”.o Multiple threads

• Interrupts – TLB is flushed• Thread switch – Invalidate Communication window entries.

Page 31: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 31

Strict Process Orientation

Kernel mode handled in same way as User mode One kernel stack per thread May lead to a large number of stacks

o Minor problem if stacks are objects in virtual memory

Page 32: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 32

Thread Control Blocks (tcb’s)

Hold kernel, hardware, and thread-specific data. Stored in a virtual array in shared kernel space.

User area Kernel area

tcb Kernel stack

Page 33: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 33

Tcb Benefits

Fast tcb access Saves 3 TLB misses per IPC Threads can be locked by unmapping the tcb Helps make thread persistent IPC independent from memory management

Page 34: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 34

Algorithmic Level

Thread ID’so L3 uses a 64 bit unique identifier (uid) containing the thread

number.o Tcb address is easily obtained

• anding the lower 32 bits with a bit mask and adding the tcb base address.

Virtual Queueso Busy queue, present queue, polling-me queue.o Unmapping the tcb includes removal from queues

• Prevents page faults from parsing/adding/deleting from the queues.

Page 35: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 35

Algorithmic Level

Timeouts and Wakeupso Operation fails if message transfer has not started t ms after

invoking it.o Kept in n unordered wakeup lists.

• A new thread’s tcb is linked into the list τ mod n.o Thread with wakeups far away are kept in a long time wakeup

list and reinserted into the normal lists when time approaches.

o Scheduler will only have to check k/n entries per clock interrupt.

o Usually costs less the 4% of ipc time.

Page 36: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 36

Algorithmic Level

Lazy Schedulingo Only a thread state variable is changed (ready/waiting).o Deletion from queues happens when queues are

parsed.• Reduces delete operations.• Reduces insert operations when a thread needs to be

inserted that hasn’t been deleted yet.

Page 37: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 37

Algorithmic Level

Short messages via registerso Register transfers are fasto 50-80% of messages ≥ 8 byteso Up to 8 byte messages can be transferred by registers

with a decent performance gain.o May not pay off for other processors.

Page 38: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 38

Interface Level

Unnecessary Copieso Message objects grouped by typeso Send/receive buffers structured in the same wayo Use same variable for sending and receiving

• Avoid unnecessary copies Parameter Passing

o Use registers whenever possible.• Far more efficient• Give compilers better opportunities to optimize code.

Page 39: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 39

Code Level

Cache Misseso Cache line fill sequence should match the usual data

access sequence. TLB Misses

o Try and pack in one page:• Ipc related kernel code• Processor internal tables• Start/end of Larger tables• Most heavily used entries

Page 40: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 40

Coding Level

Registerso Segment register loading is expensive.o One flat segment coving the complete address space.

• On entry, kernel checks if registers contain the flat descriptor.• Guarantees they contain it when returning to user level.

Jumps and Checko Basic code blocks should be arranged so that as few jumps are taken

as possible. Process switch

o Save/restore of stack pointer and address space only invoked when really necessary.

Page 41: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 41

L4 Slides

Page 42: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 42

Introduction

μ-kernels have reputation for being too slow, inflexible

Can 2nd generation μ-kernel (L4) overcome limitations?

Experiment: o Port Linux to run on L4 (Mach 3.0)o Compared to native Linux, MkLinux (Linux on 1st gen

Mach derived μ-kernel)

Page 43: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 43

Introduction (cont.)

Test speed of standard OS personality on top of fast μ-kernel: Linux implemented on L4

Test extensibility of system:o pipe-based communication implemented directly on μ-kernelo mapping-related OS extensions implemented as user taskso user-level real-time memory management implemented

Test if L4 abstractions independent of platform

Page 44: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 44

L4 Essentials

Based on threads and address spaces Recursive construction of address spaces by user-level

serverso Initial address space σ0 represents physical memoryo Basic operations: granting, mapping, and unmapping.

Owner of address space can grant or map page to another address space

All address spaces maintained by user-level servers (pagers)

Page 45: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 45

L4Linux – Design & Implementation

Fully binary compliant with Linux/X86 Restricted modifications to architecture-

dependent part of Linux No Linux-specific modifications to L4 kernel

Page 46: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 46

L4Linux – Design & Implementation

Address Spaceso Initial address space σ0 represents physical memoryo Basic operations: granting, mapping, and unmapping.o L4 uses “flexpages”: logical memory ranging from one

physical page up to a complete address space.o An invoker can only map and unmap pages that have

been mapped into its own address space

Page 47: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 47

L4Linux – Design & Implementation

Page 48: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 48

L4Linux – Design & Implementation

Address Spaces (cont.)o I/O ports are parts of address spaces.o Hardware interrupts are handled by user-level

processes. The L4 kernel will send a message via IPC.

Page 49: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 49

L4Linux – Design & Implementation

The Linux servero L4Linux will use a single-server approach.o A single Linux server will run on top of L4, multiplexing

a single thread for system calls and page faults.o The Linux server maps physical memory into its

address space, and acts as the pager for any user processes it creates.

o The Server cannot directly access the hardware page tables, and must maintain logical pages in its own address space.

Page 50: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 50

L4Linux – Design & Implementation

Interrupt Handlingo All interrupt handlers are mapped to messages.o The Linux server contains threads that do nothing but

wait for interrupt messages.o Interrupt threads have a higher priority than the main

thread.

Page 51: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 51

L4Linux – Design & Implementation

User Processeso Each different user process is implemented as a

different L4 task: Has its own address space and threads.

o The Linux Server is the pager for these processes. Any fault by the user-level processes is sent by RPC from the L4 kernel to the Server.

Page 52: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 52

L4Linux – Design & Implementation

System Callso Three system call interfaces:

• A modified version of libc.so that uses L4 primitives.• A modified version of libc.a• A user-level exception handler (trampoline) calls the

corresponding routine in the modified shared library.o The first two options are the fastest. The third is

maintained for compatibility.

Page 53: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 53

L4Linux – Design & Implementation

Signallingo Each user-level process has an additional thread for

signal handling.o Main server thread sends a message for the signal

handling thread, telling the user thread to save it’s state and enter Linux

Page 54: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 54

L4Linux – Design & Implementation

Schedulingo All thread scheduling is down by the L4 kernelo The Linux server’s schedule() routine is only used for

multiplexing it’s single thread.o After each system call, if no other system call is

pending, it simply resumes the user process thread and sleeps.

Page 55: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 55

L4Linux – Design & Implementation

Tagged TLB & Small Space.o In order to reduce TLB conflicts, L4Linux has a special

library to customize code and data for communicating with the Linux Server

o The emulation library and signal thread are mapped close to the application, instead of default high-memory area.

Page 56: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 56

Performance

What is the penalty of using L4Linux?Compare L4Linux to native Linux

Does the performance of the underlying micro-kernel matter?

Compare L4Linux to MkLinux Does co-location improve performance?

Compare L4Linux to an in-kernel version of MkLinux

Page 57: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 57

Microbenchmarks

measured system call overhead on shortest system call “getpid()”

Page 58: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 58

Microbenchmarks (cont.)

Measures specific system calls to determine basic performance.

Page 59: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 59

Macrobenchmarks

measured time to recompile Linux server

Page 60: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 60

Macrobenchmarks (cont.)

Next use a commercial test suite to simulate a system under full load.

Page 61: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 61

Performance Analysis

L4Linux is, on average 8.3% slower than native Linux. Only 6.8% slower at maximum load.

MkLinux: 49% average, 60% at maximum. Co-located MkLinux: 29% average, 37% at

maximum.

Page 62: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 62

Extensibility Performance

A micro-kernel must provide more than just the features of the OS running on top of it.

Specialization – improved implementation of Os functionality

Extensibility – permits implementation of new services that cannot be easily added to a conventional OS.

Page 63: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 63

Pipes and RPC

First five (1) use the standard pipe mechanism of the Linux kernel.(2) Is asynchronous and uses only L4 IPC primitives. Emulates POSIX

standard pipes, without signalling. Added thread for buffering and cross-address-space communication.

(3) Is synchronous and uses blocking IPC without buffering data. (4) Maps pages into the receiver’s address space.

Page 64: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 64

Virtual Memory Operations

The “Fault” operation is an example of extensibility – measures the time to resolve a page fault by a user-defined pager in a separate address space.

“Trap” – Latency between a write operation to a protected page, and the invocation of related exception handler.

“Appel1” – Time to access a random protected page. The fault handler unprotects the page, protects some other page, and resumes.

“Appel2” – Time to access a random protected page where the fault handler only unprotects the page and resumes.

Page 65: CS533 Concepts of Operating Systems

CS533 - Concepts of Operating Systems 65

Conclusion

Using the L4 micro-kernel imposes a 5-10% slowdown to native Linux. Much faster than previous micro-kernels.

Further optimizations such as co-locating the Linux Server, and providing extensibility could improve L4Linux even further.