48
1 copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 The Evolving Solaris Kernel The Evolving Solaris Kernel Past, Present & Future Jim Mauro Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc. 400 Atrium Drive, Somerset, NJ 08812 [email protected] Richard McDougall Senior Staff Engineer - Performance & Availability Engineering Sun Microsystems, Inc. [email protected] 2 copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002 The Evolving Solaris Kernel Agenda • Introduction Solaris Overview Distribution Releases System Overview & Kernel Features 64-bits The Evolution Things added, things changed Tips and tidbits along the way... Major Features Review Solaris 7 Solaris 8 Solaris 9

Evolving Solaris Kernel

Embed Size (px)

Citation preview

Page 1: Evolving Solaris Kernel

1copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The Evolving Solaris Kernel

Past, Present & Future

Jim MauroSenior Staff Engineer - Performance & Availability EngineeringSun Microsystems, Inc.400 Atrium Drive, Somerset, NJ [email protected]

Richard McDougallSenior Staff Engineer - Performance & Availability EngineeringSun Microsystems, [email protected]

2copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Agenda• Introduction

• Solaris Overview• Distribution• Releases• System Overview & Kernel Features• 64-bits

• The Evolution

• Things added, things changed• Tips and tidbits along the way...

• Major Features Review

• Solaris 7• Solaris 8• Solaris 9

Page 2: Evolving Solaris Kernel

3copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Introduction• What is Solaris?

• A complete operating environment, built on a modular, dynamickernel

• The Solaris Operating Environment (SOE)

• SunOS - the kernel (the 5.X thing)• Windowing - desktop environment. CDE default, OpenWindows

still included• GNOME 2 Beta Available• GNOME is the strategic direction

• Open Network Computing (ONC+). NFS (V2 & V3), NIS/NIS+,RPC/XDR, LDAP

4copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Distribution• Many CDs in the distribution

- WEB start CD (Installation)- OS bits, disks 1 and 2- Software Supplement (more optional bits)- Flash PROM Update- Maintenance Update- Sun Management Center- Forte’ Workshop (try n’ buy)

• Bonus Software

- Software Companion (GNU, etc)- StarOffice 6- SunONE Advantage Software (2 CDs)- Oracle Enterprise Server

Page 3: Evolving Solaris Kernel

5copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Releases• Base release, followed by quarterly update releases

• Solaris 8 - released 2/00• Solaris 8, 6/00 (update 1)• Solaris 8, 10/00 (update 2)• Solaris 8, 1/01 (update 3)• Solaris 8, 4/01 (update 4)• Solaris 8, 7/01 (update 5)• Solaris 8, 10/01 (update 6)• Solaris 8, 2/02 (update 7)

• Solaris 9 - base release, May, 2002

• The model is designed to

• Provide predicatability for planning• Provide a vehicle for getting new features, functionality and

patches out in a regular and timely fashion

6copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Releases (cont)

• So, which release am I running?

sunsys> cat /etc/release Solaris 8 6/00 s28s_u1wos_08 SPARC Copyright 2000 Sun Microsystems, Inc. All Rights Reserved. Assembled 26 April 2000sunsys>

• Check out http://docs.sun.com, “What’s New” document fora specific release

Page 4: Evolving Solaris Kernel

7copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Kernel Features

8copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

System Overview

System Call Interface

HARDWARE

SchedulingandProcessManagement

Thread

TS/IA

RT

FX

Virtual File SystemFramework

VirtualMemorySystem

Hardware AddressTranslation (HAT)

Bus and Device Drivers

KernelServices

Clocks &TimersCallouts

UFS NFS

Networking

TCPIPSockets

SD SSD

FSS SPECFS

Page 5: Evolving Solaris Kernel

9copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features• Dynamic Kernel

• Small core unix modules• Major subsystems implemented as dynamically loadable modules

(file systems, scheduling classes, STREAMS modules, systemcalls).

• Dynamic resource sizing & allocation (processes, files, locks,memory, etc)

• Dynamic sizing based on system size• Goal is to minimize/elminate need to use /etc/system tuneable parameters

10copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features

• Preemptive kernel

• Does NOT require interrupt disable/blocking via PIL forsynchronization

• Most kernel code paths are preemptable• A few non-preemption points in critical code paths• SCALABILITY & LOW LATENCY INTERRUPTS

• Well-defined, layered interfaces

• Module support, synchronization primitives, etc

Page 6: Evolving Solaris Kernel

11copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features

• Multithreaded kernel

• Kernel threads perform core system services• Fine grained locking for concurrency• Threaded subsystems

• Multithreaded process model

• User level threads and synchronization primitives• Solaris (UI) & POSIX threads• Two-level (M x N) model, evolved to one-level model

• Alternate thread library in Solaris 8• Default thread library Solaris 9

12copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features• Table-driven dispatcher with multiple scheduling class

support

• Dynamically loadable/modifyable table values• Relatively “easy” to add new scheduling classes

• FSS and FX in Solaris 9

• Realtime support with preemptive kernel

• Additional kernel support for realtime applications (memory pagelocking, asynchronous I/O, processor sets, interrupt control, high-res clock)

• Kernel tuning via text file ( /etc/system, driver.conf )

• Some things can be done “on the fly”

mdb(1)

Page 7: Evolving Solaris Kernel

13copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features• Tightly integrated virtual memory and file system support

• Dynamic page cache memory implementation

• Virtual File System (VFS) Implementation

• Object-like abstractions for files and file systems• Facilitates new features/functionality

• Kernel sockets via sockfs• procfs (/proc) enhancements• Doors (doorfs)• fdfs, swapfs, tmpfs

(procfs), Doors (doorfs), fdfs, swapfs, tmpfs• Disk-based, distributed & pseudo file systems

14copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features• 32-bit and 64-bit kernel

• 64-bit kernel required for UltraSPARC-III based systems(SunBlade, SunFire)

• 32-bit apps run just fine...

• Solaris DDI/DKI Implementation

• Device driver interfaces• Includes interfaces for dynamic attach/detach/pwr

• Rich set of standards-compliant interfaces• POSIX, UNIX International

Page 8: Evolving Solaris Kernel

15copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Kernel Features• Integrated networking facilities

• TCP/IP

IPv4, IPSec, IPv6• Name services - DNS, NIS, NIS+, LDAP• NFS - defacto standard distributed file system, NFS-V2 & NFS-V3• Remote Procedure Call/External Data Representation (RPC/XDR)

facilities• Sockets, TLI, Federated Naming APIs

16copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

64-Bits

Page 9: Evolving Solaris Kernel

17copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

64-bit Solaris• Since Solaris 7, full 32-bit binary compatibility

• A simple directory namespace rule providing for the supportand co-existence of 32-bit binaries on a 64-bit Solaris 8system;

For every directory on the system that contains binaryobject files (executables, shared object libraries, etc), there is asparcv9 subdirectory containing the 64-bit versions

• All kernel modules must be the of the same data model; ILP32(32-bit data model) or LP64 (64-bit data model)

• 64-bit kernel required to run 64-bit apps

18copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

32 bit limits• Solaris 2.5

• Heap is limited to 2GB, malloc will fail beyond 2GB

• Solaris 2.5.1• Heap limited to 2GB by default• Can go beyond 2GB with kernel patch 103640-08+• can raise limit to 3.75G by using ulimit or rlimit() if uid=root• Do not need to be root with 103640-23+

• Solaris 2.6• Heap limited to 2GB by default• can raise limit to 3.75G by using ulimit or rlimit()

• Solaris 7 & 8• Limits are raised by default• 32 bit program can malloc 3.99GB

Page 10: Evolving Solaris Kernel

19copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris/SPARC V8/V9 Data Model• Defines the width of integral data types

• 32-bit Solaris - ILP32• 64-bit Solaris - LP64

’C’ data type ILP32 LP64

char 8 8

short 16 16

int 32 32

long 32 64

longlong 64 64

pointer 32 64

enum 32 32

float 32 32

double 64 64

quad 128 128

20copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

64-bit Performance• 64 Bit Virtual Address Space

• (+) Free from the 3.9GB barrier• (+) Memory map large files

• 64 Bit data types

• (+) 64 Bit Arithmetic, 64 Bit Registers• (-) Pointers/Longs require moving 8 bytes

• Typically ~5% delta• Larger cache footprint

• (-) Larger Stack

Page 11: Evolving Solaris Kernel

21copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Which Data Model Is Booted?• Use isainfo (1)

sunsys> isainfosparcv9 sparcsunsys> isainfo -b64sunsys> isainfo -v64-bit sparcv9 applications32-bit sparc applications

• Or isalist (1)

sunsys> isalistsparcv9+vis sparcv9 sparcv8plus+vis sparcv8plus sparcv8 sparcv8-fsmuld sparcv7 sparcsunsys>

• man isaexec (3C)

• Invoke isa-specific executable• To create wrappers for shipping both 32-bit and 64-bit binaries,

and automatically launching the correct one

22copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Evolving Features & TechnicalTidbits

Page 12: Evolving Solaris Kernel

23copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The Evolution

1992 1993 1994 19981995 1996

Solaris 2.0VFS/VnodeISMUP only

Solaris 2.14-way SMP

Solaris 2.420-way SMPNew KMA Slab AllocatorCachefsCDE

Solaris 2.5Large pages (kernel)DoorsNFS V3sun4u

Solaris 2.5.1sun4u MP

Solaris 2.6Large filesProcessor SetsKernel SocketslockstatUFS directioDR

Solaris 764-bit kernel64-bit procsUFS loggingPriority Paging

2000 2002

Solaris 8New KMACyclicsT2US-IIISunFireStarCatFreeware

Solaris 9

UFS++

SVMMPSSMPOResource PoolsFSSFX

Solaris 2.2sun4d SMP

Solaris 2.38-way SMPNew DNLC

Large UFS

24copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

General Priorities• Reliability, scalability, performance

• on-going

• Standards compliance

• SunOS 4.X binary compatibility

• Threads / SMP scalability

• Big systems performance

• VM & I/O

• Lessons learned on threads

• Resource management

• Consolidation, ROI, TCO

• Resource Pools, Service Containers, Resource Virtualization

Page 13: Evolving Solaris Kernel

25copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Virtual Memory & The DynamicPage Cache

• Creating a dynamic page cache allows for all of physicalmemory to be used as disk buffer cache (read(2), write(2))

• The evolution of systems hardware, RAID and general I/Otuning can create environments where the buffer cachethrottles the VM system

• The VM roller coaster (keeping the freelist sane)

• Priority paging (2.6 & 7) provided a band-aid

• Using directio bypasses the page cache for UFS reads/writes

• Solaris 8 implements a new cyclic page cache

26copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Global Memory Management

• Demand Paged• Not recently used (NRU) algorithm

• Dynamic file system cache• Where has all my memory gone?

• Page scanner• Operates bottom up from physical pages• Default mode treats all memory equally

Page 14: Evolving Solaris Kernel

27copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The Old Page Cache

kernel memory

free list

process memoryheap, data, stack

segmap

pagescanner

pages pushedout of segmap

reclaim

28copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The Cyclic Page Cache

kernel memory

free list

process memoryheap, data, stack

segmap

pages pushedout of segmap

reclaim

cache list

Page 15: Evolving Solaris Kernel

29copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Global Paging Dynamics

slowscan

fastscan

lotsfree cachefreeminfreethrottle-free

Scan

Rate

Free Memorycachefree+deficit

desfree

pages_before_pager

8192

100

4MB

4MB

8MB

16M

B

32M

B

(1GB Example)

30copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Priority Paging

• Solaris 7 FCS or Solaris 2.6 with T-105181-09

• http://www.sun.com/sun-on-net/performance/priority_paging.html• Set priority_paging=1 or cachefree in /etc/system

• Solaris 7 Extended vmstat

• ftp://playground.sun.com/pub/rmc/memstat

• Solaris 8

• New VM system, priority paging implemented at the core (makesure it’s disabled in Sol 8!)

• New vmstat flag, “-p”

• Solaris 9

• Multiple page size support (MPSS)• Memory Placement Optimizations (MPO)

Page 16: Evolving Solaris Kernel

31copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Memory Monitoring• Use vmstat or the memstat command on Solaris 7

• ftp://playground.sun.com/pub/rmc/memstat# vmstat 3

procs memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s4 s6 in sy cs us sy id 0 0 0 269776 21160 0 0 0 0 0 0 0 0 0 0 2 154 200 92 0 0 100 0 0 0 269776 21152 0 0 0 0 0 0 0 0 0 0 2 155 203 113 0 0 99 0 0 0 269720 3896 5 17 80 0 109 0 59 0 0 0 2 221 773 134 0 2 98 0 0 0 269616 3792 0 0 160 0 160 0 76 0 0 0 2 279 242 130 0 1 99 0 0 0 269616 3792 0 0 192 0 192 0 105 0 0 0 2 294 225 138 0 1 99 0 0 0 269616 3800 1 90 234 5 232 0 99 0 0 0 2 323 964 305 5 3 92 0 0 0 269656 3832 0 0 106 0 106 0 51 0 0 0 2 237 212 121 0 1 99

# memstat 3 (Solaris 7 Only)or# vmstat -p 3 (Solaris 8+)

memory ---------- paging ----------- - executable - - anonymous - -- filesys -- --- cpu --- free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id 21160 0 22 0 5 5 0 0 0 0 0 0 0 0 0 5 5 0 1 0 99 21152 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 21152 0 18 34 2 2 0 0 0 0 0 0 0 0 34 2 2 0 1 0 99 11920 0 0 277 106 272 0 153 0 0 32 0 98 149 277 8 90 0 3 0 97 11888 0 0 256 69 224 0 106 0 0 16 0 69 178 256 0 29 0 3 1 96 11896 0 0 213 106 261 0 124 0 0 26 0 106 232 213 0 2 0 3 13 84 11904 0 0 245 66 242 0 122 0 0 16 0 64 221 245 2 5 0 2 0 98 11896 0 0 245 64 224 0 132 0 0 21 0 64 189 245 0 13 0 2 0 98

32copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Simple Memory Rule:• Identifying a memory shortage without PP:

• Scanner not scanning -> no memory shortage• Scanner running, page ins and page outs, swap device activity ->

potential memory shortage• (use separate swap disk or 2.6 iostat -p to measure swap partition

activity)

• Identifying a memory shortage with PP on Sol 7:• api and apo should be zero in memstat, non zero is a clear sign of

memory shortage

• Identifying a memory shortage on Sol 8:• scan rate != 0• freemem is real

Page 17: Evolving Solaris Kernel

33copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Memory Summary• Solaris 9

# mdb -k> ::memstatPage Summary Pages MB %Tot------------ ---------------- ---------------- ----Kernel 21146 165 9%Anon 16891 131 7%Exec and libs 8389 65 3%Page cache 8248 64 3%Free (cachelist) 2490 19 1%Free (freelist) 190309 1486 77%

Total 247473 1933

• Solaris 8 and earlier

# prtmem

Total memory: 1933 MegabytesKernel Memory: 164 MegabytesApplication: 128 MegabytesExecutable & libs: 65 MegabytesFile Cache: 64 MegabytesFree, file cache: 19 MegabytesFree, free: 1491 Megabytes

34copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The Threads Model

• Original 2-level, MxN model design goals

• Scalability• Lightweight threads• Pools of Virtual Processors (LWPs)• Bound threads available

• Lessons learned...

• User level thread scheduling is complex• Signal delivery is, at times, a nightmare• Kernel threads are not as expensive as they used to be

• What we have today

• Alternate thread library in Solaris 8 (/usr/lib/lwp/libthread.so)• 1-level is the default in Solaris 9 (/usr/lib/libthread.so)

Page 18: Evolving Solaris Kernel

35copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

2-Level MxN Model

• The 1 level model is effectively all bound threads (proc 4)

proc 1 proc 2 proc 3 proc 4

LWP’s

Kernel Threads

User ThreadsProcesses

An unattachedkernel thread

Processors (CPU’s)

User LayerKernel Layer

Hardware Layer

the dispatcher

36copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Resource Management

• Effective management of hardware resources to applications

• Large application performance• Multiple apps per Solaris instance (consolidation)• Provide boundaries on resource consumption by applications

• Resource categories

• Processors (CPUs)• Memory (physical memory)• Disk IO bandwidth/latency/IOPS• Network bandwidth/latency

• This is an on-going effort, with significant improvements insubsequent Solaris 9 quarterly releases

Page 19: Evolving Solaris Kernel

37copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Processor Control Commands• CPU related commands

• psrinfo (1M) - provides information about theprocessors on the system. Use “-v” for verbose

• psradm (1M) - online/offline processors, interruptenable/disable

• psrset (1M) - creation and management of processorsets

• pbind (1M) - original processor bind command. Doesnot provide exclusive binding

• processor_bind (2), processor_info (2),pset_bind (2), pset_info (2), pset_creat (2),p_online (2): system calls to do thingsprogrammatically

38copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 Resource Management

• Tasks, Projects & Extended Accounting

• Task - A collection of processes• Project - A collection of tasks

Projects

Task Task Task

proc proc procproc proc procproc

Page 20: Evolving Solaris Kernel

39copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 Resource Management

• Tasks & Projects provide abstractions for binding togetherrelated processes, for the purpose of

• Resource management. Tasks and Projects can be bound toprocess sets, have scheduler changes applied to them, etc.

• Resource controls. Resource limits can be applied at the Project orTask level.

• Resource monitoring. Tools have been enhanced to monitorutilization at the Project or Task level.• “prstat -J” - Display statistics for processes and projects• “prstat -T” - Display statistics for processes and tasks• Extended accounting. The accounting facility had been enhanced to provide

project and task level accounting data.

40copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 Resource Controls

• The following resource controls are available

project.cpu-shares: Number of CPU shares (FSS) available to this project

task.max-cpu-time: Maximum CPU time available to the processes in this task (milliseconds)

task.max-lwps: Maximum number of LWPs available to the processes in this task

process.max-cpu-time: Max CPU time available to this process

process.max-file-descriptor: Max number of open files for this process

process.max-file-size: Max file size

process.max-core-size: Max core file size

process.max-data-size: Max size of the process’s data segment

process.max-stack-size: Max size of the process’s stack

Page 21: Evolving Solaris Kernel

41copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 Fair Share Scheduler

• Share based (versus priority based) process scheduling

• Designed to provide a guaranteed minimum amount of CPUresources to a specific application (project/task)

• Defining a maximum, or ceiling, not currently available• Shares are allocated to projects

• Shares are not percentages

• Shares allocated are relative to shares allocated to other projects• The total number of shares allocated also matters

• FSS can be used in conjunction with processor sets

• Finer grained management and control

42copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

FSS & Processor Sets

Project A16.66% (1/6)

Project B33.33%(2/6)

Project C50%(3/6)

Processor Set 12 CPUs25% of System

Project B40%(2/5)

Project C60%(3/5)

Processor Set 24 CPUs50% of System

Project C100%(3/3)

Processor Set 32 CPUs25% of System

Page 22: Evolving Solaris Kernel

43copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Resource Pools

• Provides a facility for stateful (persistent) processor sets andproject binding, as well as scheduling class assignment

• Resource pool management is done via pooladm (1M),poolbind (1M), and poolcfg (1M).

• /etc/pooladm.conf provides persistance across reboots(managed via poolcfg (1M))

• poolbind (1M) provides for binding of projects or tasks to aresource pool

• /etc/projects can define a resource pool for a project or task

44copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris Release Features Summary

Page 23: Evolving Solaris Kernel

45copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 7 - New Features• 64-Bits

• Kernel• 64-bit binary support• Full binary compatibility for 32-bit executables

• UFS logging

• mount -o logging• Logs to spare blocks in cylinder group• No fsck

• UFS noatime

• Disable access time update to inodes

• pgrep & pkill

• Ends ps -ef | grep proc_name | aw k ‘ { print $2 }’

46copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 7 - New Features• traceroute bundled

• dumpadm(1M)

• Configure a seperate raw partition for dumps• Dump running systems

• LDAP Client Library

• TCP with SACK

• Selective Acknowledgement - RFC 2018

• libdevinfo (3)

• Device configuration information APIs

• truss (1) Enhanced

• User level function tracing. “-u”, “-U”

Page 24: Evolving Solaris Kernel

47copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 8 - New Features• Cyclic Page Cache

• Enhanced VM page management functionality• Priority for page allocation given to process segments• freemem is real!

• System Message IDs

• Numeric ID generated for syslog messages

• devfsadm(1M)

• One tool for device configuration/management• DR events managed through devfsadmd

• mmap MAP_ANON

• a = mmap( addr, len, prot, flags| MAP_ ANON,-1, off);

• POSIX High Resolution Timers

• CLOCK_HIGHRES via new Cyclics kernel subsystem

48copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 8 - New Features• prstat (1)

• Top-like curses based process monitor utility

• apptrace (1)

• truss-like utility for tracing user-level library calls

• /proc tools enhanced to work on core files

• pstack(1), pcred(1), pfiles(1)

• coreadm(1M)

• System-wide core file management

• mdb(1)

• New kernel debugger - replace adb & crash• Supports use of adb macros and crash utilities• Evolved to manage user code debugging (Sol 9)

Page 25: Evolving Solaris Kernel

49copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 8 - New Features• User Level Priority Inheritance

• User defined mutex locks attribute

• Forced unmount

• umount -f

• Alternate thread’s library

• /usr/lib/lwp/libthread.so - provides all bound threads• Does not require re-compilation

• Freeware CD

• apache, bash, bzip2, tcsh, gcc, mkisofs, less, zsh, Glib, GTK+, etc,etc,...

50copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features• Many, many (but not all) Solaris 9 features have been

backported to Solaris 8

• Available in various Solaris 8 update releases

• Resource Management

• Resource pools - configure boundaries on resources consumed byprocesses and tasks

• Processors today, memory coming• Resource pools cross reboots (unlike processor sets and bindings)• See prctl (1), pooladm (1M), poolcfg (1M), poolbind (1M),

rctladm (1M), project (4)

• Fixed-Priority Scheduling Class (FX)

• TS class priority range, but priorities remain fixed

• Fair Share Scheduling Class (FSS)

• Share-based (versus priority-based) CPU allocation

Page 26: Evolving Solaris Kernel

51copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features• Command line process facilties

• pargs (1) - dump args and env associated with a live process, orcore file

• preap (1) - remove zombies (Harry Cooper & Ben could have usedthis in 1968!)

• du(1), df(1M) and ls(1) - New “-h” option

• “-h” - provide human-readable output format.• Lists sizes in Kbytes, Mbytes, Gbytes, etc...

• Multiple Page Size Support (MPSS)

• Support of pages larger than 8k for process stack, heap andmmap’d anonymous memory

• Actual supported page sizes hardware dependent• UltraSPARC-III supports 8k, 64k, 512k, 4MB...

52copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features• MPSS (cont)

jurassic> pagesize -a8192655365242884194304jurassic>

• New Threads Library/Model

• 1 Level threads model - all bound threads• What was the alternate threads library in Solaris 8 is the default (in

/usr/lib) in Solaris 9.

• Dynamic Intimate Shared Memory (DISM)

• Allows database to dynamically shrink/grow the shared segment• Original ISM implementation was a big performance win (shared

translation information, large pages), but was fixed in size• DISM gives the best of both worlds

Page 27: Evolving Solaris Kernel

53copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features• Security

• Internet Key Exchange (IKE) Protocol• Secure Shell (ssh) - SSHv1 & SSHv2• Kerberos Key Distribution Center (KDC) & Admin Tools• Secure LDAP• 128-bit Encryption• Role-Based Access Controls (RBAC) Enhanced• tcp-wrappers 7.6 in freeware CD• Xserver encrypted connections supported

54copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Solaris 9 - New Features• iPlanet Director Server

• LDAP Server bundled/integrated

• LDAP Name Service Support

• NIS+ - to - LDAP Migration Tool

• FTP Server

• Based on WU-ftp server

• PPP 4.0

• Includes PPPoE (Solaris 8 7/01)

• IP Network Multipathing (Solaris 8 10/00)

• Solaris Volume Manager

• Formerly Solaris DiskSuite• Soft partitions and Device ID support

Page 28: Evolving Solaris Kernel

55copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Summary

• Steady, sustained progress on key areas - scalability,reliability, performance, features

• Going forward

• Resource management - memory, service containers• Observability - More & better tools• Resilience - fault detection, isolation, containment• Management - Zero downtime admin

• patches, upgrades

• Reliability, performance, always at the top

56copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Supplemental Slides

Page 29: Evolving Solaris Kernel

57copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Kernel Statistics

• Solaris uses a central mechanism for kernel statistics

• “kstat”• Kernel providers

• raw statistics (c structure)• typed data• classed statistics

• Perl and C API• kstat(1M) command

# kstat -n system_miscmodule: unix instance: 0name: system_misc class: misc avenrun_15min 90 avenrun_1min 86 avenrun_5min 87 boot_time 1020713737 clk_intr 2999968 crtime 64.1117776 deficit 0 lbolt 2999968 ncpus 2

58copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Memory Accounting• The ps command

• SZ = Virtual Size

• RSS = Resident Set Size (including shared)# ps -ale

USER PID %CPU %MEM SZ RSS TT S START TIME COMMANDroot 22998 12.0 0.8 4584 1992 ? S 10:05:30 3:22 /usr/sbin/nsr/nsrcroot 23672 1.0 0.7 1736 1592 pts/16 O 10:22:54 0:00 /usr/ucb/ps -auxroot 3 0.4 0.0 0 0 ? S Sep 28 166:38 fsflushroot 733 0.4 1.0 6352 2496 ? S Sep 28 174:29 /opt/SUNWsymon/jreroot 345 0.3 0.7 2968 1736 ? S Sep 28 55:39 /usr/sbin/nsr/nsrdroot 23100 0.2 0.5 3880 1104 ? S Oct 15 0:25 rpc.rstatdroot 732 0.2 2.5 9920 6304 ? S Sep 28 94:43 esd - init topolog

Page 30: Evolving Solaris Kernel

59copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The pmap command• Verbose Process mappings

• Solaris 8 private/shared• Solaris 9 private=Anon, shared=RSS-Anon

# pmap -x 123 Address Kbytes RSS Anon Locked Mode Mapped File00010000 8 8 - - r-x-- mmap00020000 8 8 8 - rwx-- mmap01000000 1024 1024 - - rw-s- dev:0,2 ino:530465702000000 1024 1024 512 - rw--- dev:0,2 ino:530465703000000 1024 1024 512 - rw--R dev:0,2 ino:530465704000000 1024 1024 1024 - rw--- [ anon ]05000000 512 512 512 - rw--R [ anon ]FF280000 680 680 - - r-x-- libc.so.1FF33A000 32 32 32 - rwx-- libc.so.1FF380000 16 16 - - r-x-- libc_psr.so.1FF3A0000 8 8 - - r-x-- libdl.so.1FF3B0000 8 8 8 - rwx-- [ anon ]FF3C0000 152 152 - - r-x-- ld.so.1FF3F6000 8 8 8 - rwx-- ld.so.1FFBFA000 24 24 24 - rwx-- [ stack ]-------- ------- ------- ------- -------total Kb 5552 5552 2640 -

60copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

SWAP Space ctd...# swap -stotal: 101456k bytes allocated + 12552k reserved = 114008k used, 597736k available

should read:

total: 101456k bytes unallocated + 12552k allocated = 114008k reserved, 597736kavailable

Page 31: Evolving Solaris Kernel

61copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Swap:# ./prtswap -lSwap Reservations:--------------------------------------------------------------------------Total Virtual Swap Configured: 767MB =RAM Swap Configured: 255MBPhysical Swap Configured: + 512MB

Total Virtual Swap Reserved Against: 513MB =RAM Swap Reserved Against: 1MBPhysical Swap Reserved Against: + 512MB

Total Virtual Swap Unresv. & Avail. for Reservation: 253MB =Physical Swap Unresv. & Avail. for Reservations: 0MBRAM Swap Unresv. & Avail. for Reservations: + 253MB

Swap Allocations: (Reserved and Phys pages allocated)--------------------------------------------------------------------------Total Virtual Swap Configured: 767MBTotal Virtual Swap Allocated Against: 467MB

Physical Swap Utilization: (pages swapped out)--------------------------------------------------------------------------Physical Swap Free (should not be zero!): 232MB =Physical Swap Configured: 512MBPhysical Swap Used (pages swapped out): - 279MB

62copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The pmap command• Swap reservations ( Solaris 9):

# pmap -S 123 Address Kbytes Swap Mode Mapped File00010000 8 - r-x-- mmap00020000 8 8 rwx-- mmap01000000 1024 - rw-s- dev:0,2 ino:530465702000000 1024 1024 rw--- dev:0,2 ino:530465703000000 1024 512 rw--R dev:0,2 ino:530465704000000 1024 1024 rw--- [ anon ]05000000 512 512 rw--R [ anon ]FF280000 680 - r-x-- libc.so.1FF33A000 32 32 rwx-- libc.so.1FF380000 16 - r-x-- libc_psr.so.1FF3A0000 8 - r-x-- libdl.so.1FF3B0000 8 8 rwx-- [ anon ]FF3C0000 152 - r-x-- ld.so.1FF3F6000 8 8 rwx-- ld.so.1FFBFA000 24 24 rwx-- [ stack ]-------- ------- -------total Kb 5552 3152

Page 32: Evolving Solaris Kernel

63copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Shared Memory

• System V Initimate Shared Memory (ISM)

• Shared translation data structures• 4MB TLB Page Size• Locked pages• Invoke with an additional flag to shmat () - SHARE_MMU• Default shared memory mode for Oracle RDBMS

• System V Dynamic Intimate Shared Memory (DISM)

• Solaris 8 U3• Pageable variant of ISM• Integrated with Oracle 9i (dynamic SGA)• 8k TLB Page Size for Solaris 8• 4MB TLB Page Size for Solaris 9 U1

64copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The pmap command# pmap -x 1549215492: ./maps Address Kbytes RSS Anon Locked Mode Mapped File00010000 8 8 - - r-x-- maps00020000 8 8 8 - rwx-- maps00022000 20344 16248 16248 - rwx-- [ heap ]03000000 1024 1024 - - rw-s- dev:0,2 ino:462848704000000 1024 1024 512 - rw--- dev:0,2 ino:462848705000000 1024 1024 512 - rw--R dev:0,2 ino:462848706000000 1024 1024 1024 - rw--- [ anon ]07000000 512 512 512 - rw--R [ anon ]08000000 8192 8192 - 8192 rwxs- [ dism shmid=0x5 ]09000000 8192 4096 - - rwxs- [ dism shmid=0x4 ]0A000000 8192 8192 - 8192 rwxsR [ ism shmid=0x2 ]0B000000 8192 8192 - 8192 rwxsR [ ism shmid=0x3 ]FF280000 680 672 - - r-x-- libc.so.1FF33A000 32 32 32 - rwx-- libc.so.1FF390000 8 8 - - r-x-- libc_psr.so.1FF3A0000 8 8 - - r-x-- libdl.so.1FF3B0000 8 8 8 - rwx-- [ anon ]FF3C0000 152 152 - - r-x-- ld.so.1FF3F6000 8 8 8 - rwx-- ld.so.1FFBFA000 24 24 24 - rwx-- [ stack ]-------- ------- ------- ------- -------total Kb 50464 42264 18888 16384

Page 33: Evolving Solaris Kernel

65copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Multiple Page Size Support

• Solaris 8

• Large (4MB) pages with ISM/DISM for shared memory

• Solaris 9

• "Multiple Page Size Support"• Optional large pages for heap/stack• A wrapper for unchanged programs (ppgsz)• Programatically via memcntl(3C)• Shared library for existing binaries (LD_PRELOAD) (/usr/lib/

libmpss.so)• pmap enhancements to observe page sizes (pmap -sx)• Tool to observe potential gains (trapstat -T)

66copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

TLB Trap CPU Accounting

# trapstat -t 3cpu | itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim | %tim-----+-------------------------------+-------------------------------+----- 0 k| 25 0.0 0 0.0 | 29558 0.5 6 0.0 | 0.6 0 u| 9728 0.1 1 0.0 | 17943 0.3 3 0.0 | 0.5-----+-------------------------------+-------------------------------+----- 1 k| 0 0.0 0 0.0 | 19001 1.2 3 0.0 | 1.2 1 u| 7872 0.2 0 0.0 | 16300 0.5 0 0.0 | 0.8=====+===============================+===============================+=====ttl | 17625 0.2 1 0.0 | 82802 1.3 12 0.0 | 1.5

Page 34: Evolving Solaris Kernel

67copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The pmap command# pmap -xs 15492 Address Kbytes RSS Anon Locked Pgsz Mode Mapped File00010000 8 8 - - 8K r-x-- maps00020000 8 8 8 - 8K rwx-- maps00022000 3960 3960 3960 - 8K rwx-- [ heap ]00400000 8192 8192 8192 - 4M rwx-- [ heap ]00C00000 4096 - - - - rwx-- [ heap ]01000000 4096 4096 4096 - 4M rwx-- [ heap ]03000000 1024 1024 - - 8K rw-s- dev:0,2 ino:462848708000000 8192 8192 - 8192 - rwxs- [ dism shmid=0x5 ]09000000 4096 4096 - - 8K rwxs- [ dism shmid=0x4 ]0A000000 4096 - - - - rwxs- [ dism shmid=0x2 ]0B000000 8192 8192 - 8192 4M rwxsR [ ism shmid=0x3 ]FF280000 136 136 - - 8K r-x-- libc.so.1...FF390000 8 8 - - 8K r-x-- libc_psr.so.1FF3A0000 8 8 - - 8K r-x-- libdl.so.1FF3B0000 8 8 8 - 8K rwx-- [ anon ]FF3C0000 152 152 - - 8K r-x-- ld.so.1FF3F6000 8 8 8 - 8K rwx-- ld.so.1FFBFA000 24 24 24 - 8K rwx-- [ stack ]-------- ------- ------- ------- -------total Kb 50464 42264 18888 16384

68copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Memory Placement Optmization

• Memory locality optimization for non-uniform memoryarchitectures

• Solaris 9 Update 1• Ex800 machines are slightly non-uniform• F15k systems are slightly more non-uniform

• Machine described as groups of latency (lgroups)

• Unit is typically a system board (processors+memory)• Lgroups are an artifact of the hardware architecture (not user

configurable)• Threads are assigned a “home” lgroup

• Memory allocated “close” to the threads accessing it

• Program heap and stack is allocated on the same lgroup• Shared memory allocated round robin across boards in the system

or processor set. Different programatic policies also provided.

Page 35: Evolving Solaris Kernel

69copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - mpstat# mpstat 1CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 8 0 0 6611 456 300 1637 7 26 1110 0 135 33 45 2 21 9 1 0 1294 250 100 2156 3 29 1659 0 68 9 63 0 28 10 0 0 3232 308 100 2357 2 36 1893 0 104 2 66 2 30 11 0 0 647 385 100 1952 1 19 1418 0 21 4 83 0 13 12 0 0 190 225 100 307 0 1 589 0 0 0 98 0 2 13 0 0 624 373 100 1689 2 14 1175 0 87 7 80 2 12 14 0 0 392 312 100 1810 1 12 1302 0 49 2 80 2 15 15 0 0 146 341 100 2586 2 13 1676 0 8 0 82 1 17 16 0 0 382 355 100 1968 2 7 1628 0 4 0 88 0 12 17 0 0 88 283 100 689 0 4 474 0 95 1 94 2 3 18 0 0 3571 152 104 568 0 7 2007 0 15 0 93 1 6 19 0 0 3133 278 100 2043 2 24 1307 0 113 7 69 1 22 20 0 0 385 242 127 2127 2 22 1296 0 36 0 73 0 26 21 0 0 152 369 100 2259 0 10 1400 0 140 2 84 2 12 22 0 0 3964 241 120 1754 3 25 1085 0 91 11 62 1 26 23 0 2 555 193 100 1827 2 23 1148 0 288 7 64 7 22 24 0 0 811 245 113 1327 2 23 1228 0 110 3 76 4 17 25 0 0 105 500 100 2369 0 11 1736 0 6 0 88 0 11 26 0 0 163 395 131 2383 2 16 1487 0 64 2 79 1 18 27 0 1 718 1278 1051 2073 4 23 1311 0 237 9 67 6 19 28 0 0 868 271 100 2287 4 27 1309 0 139 9 55 0 36 29 0 0 931 302 103 2480 3 29 1569 0 165 9 66 2 23 30 0 0 2800 303 100 2146 2 13 1266 0 152 11 70 3 16 31 0 1 1778 320 100 2368 2 24 1381 0 261 11 56 5 28

70copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - lockstat

# lockstat sleep 10

Adaptive mutex spin: 293311 events in 10.015 seconds (29288 events/sec)

Count indv cuml rcnt spin Lock Caller-------------------------------------------------------------------------------218549 75% 75% 1.00 3337 0x71ca3f50 entersq+0x31426297 9% 83% 1.00 2533 0x71ca3f50 putnext+0x10419875 7% 90% 1.00 4074 0x71ca3f50 strlock+0x53414112 5% 95% 1.00 3577 0x71ca3f50 qcallbwrapper+0x274 2696 1% 96% 1.00 3298 0x71ca51d4 putnext+0x50 1821 1% 97% 1.00 59 0x71c9dc40 putnext+0xa0 1693 1% 97% 1.00 2973 0x71ca3f50 qdrain_syncq+0x160 683 0% 97% 1.00 66 0x71c9dc00 putnext+0xa0 678 0% 98% 1.00 55 0x71c9dc80 putnext+0xa0 586 0% 98% 1.00 25 0x71c9ddc0 putnext+0xa0 513 0% 98% 1.00 42 0x71c9dd00 putnext+0xa0 507 0% 98% 1.00 28 0x71c9dd80 putnext+0xa0 407 0% 98% 1.00 42 0x71c9dd40 putnext+0xa0 349 0% 98% 1.00 4085 0x8bfd7e1c putnext+0x50 264 0% 99% 1.00 44 0x71c9dcc0 putnext+0xa0 187 0% 99% 1.00 12 0x908a3d90 putnext+0x454 183 0% 99% 1.00 2975 0x71ca3f50 putnext+0x45c 170 0% 99% 1.00 4571 0x8b77e504 strwsrv+0x10 168 0% 99% 1.00 4501 0x8dea766c strwsrv+0x10 154 0% 99% 1.00 3773 0x924df554 strwsrv+0x10

Page 36: Evolving Solaris Kernel

71copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - lockstatAdaptive mutex block: 2818 events in 10.015 seconds (281 events/sec)

Count indv cuml rcnt nsec Lock Caller------------------------------------------------------------------------------- 2134 76% 76% 1.00 1423591 0x71ca3f50 entersq+0x314 272 10% 85% 1.00 893097 0x71ca3f50 strlock+0x534 152 5% 91% 1.00 753279 0x71ca3f50 putnext+0x104 134 5% 96% 1.00 654330 0x71ca3f50 qcallbwrapper+0x274 65 2% 98% 1.00 872630 0x71ca51d4 putnext+0x50 9 0% 98% 1.00 260444 0x71ca3f50 qdrain_syncq+0x160 7 0% 98% 1.00 1390807 0x8dea766c strwsrv+0x10 6 0% 99% 1.00 906048 0x88876094 strwsrv+0x10 5 0% 99% 1.00 2266267 0x8bfd7e1c putnext+0x50 4 0% 99% 1.00 468550 0x924df554 strwsrv+0x10 3 0% 99% 1.00 834125 0x8dea766c cv_wait_sig+0x198 2 0% 99% 1.00 759290 0x71ca3f50 drain_syncq+0x380 2 0% 99% 1.00 1906397 0x8b77e504 cv_wait_sig+0x198 2 0% 99% 1.00 645358 0x71dd69e4 qdrain_syncq+0xa0

72copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - lockstatSpin lock spin: 52335 events in 10.015 seconds (5226 events/sec)

Count indv cuml rcnt spin Lock Caller-------------------------------------------------------------------------------23531 45% 45% 1.00 4352 turnstile_table+0x79c turnstile_lookup+0x48 1864 4% 49% 1.00 71 cpu[19]+0x40 disp+0x90 1420 3% 51% 1.00 74 cpu[18]+0x40 disp+0x90 1228 2% 54% 1.00 23 cpu[10]+0x40 disp+0x90 1159 2% 56% 1.00 60 cpu[16]+0x40 disp+0x90 1138 2% 58% 1.00 22 cpu[24]+0x40 disp+0x90 1108 2% 60% 1.00 57 cpu[17]+0x40 disp+0x90 1082 2% 62% 1.00 24 cpu[11]+0x40 disp+0x90 1039 2% 64% 1.00 25 cpu[29]+0x40 disp+0x90 1009 2% 66% 1.00 17 cpu[23]+0x40 disp+0x90 1007 2% 68% 1.00 21 cpu[31]+0x40 disp+0x90 882 2% 70% 1.00 29 cpu[13]+0x40 disp+0x90 846 2% 71% 1.00 25 cpu[28]+0x40 disp+0x90 833 2% 73% 1.00 27 cpu[30]+0x40 disp+0x90

Page 37: Evolving Solaris Kernel

73copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - lockstatThread lock spin: 1232 events in 10.015 seconds (123 events/sec)

Count indv cuml rcnt spin Lock Caller------------------------------------------------------------------------------- 468 38% 38% 1.00 1018 turnstile_table+0x79c ts_tick+0x8 251 20% 58% 1.00 683 turnstile_table+0x79c turnstile_block+0x1f4 180 15% 73% 1.00 152 sleepq_head+0x7f4 ts_tick+0x8 68 6% 78% 1.00 35 sleepq_head+0x7f4 turnstile_block+0x1f4 31 3% 81% 1.00 650 sleepq_head+0x7f4 ts_update_list+0x60 17 1% 82% 1.00 34 cpu[27]+0x64 cv_wait+0x18 7 1% 83% 1.00 64 cpu[13]+0x64 cv_wait+0x18 7 1% 84% 1.00 146 cpu[30]+0x64 ts_tick+0x8 6 0% 84% 1.00 56 cpu[29]+0x64 ts_tick+0x8 6 0% 84% 1.00 37 cpu[8]+0x64 turnstile_block+0x1f4 6 0% 85% 1.00 96 cpu[9]+0x64 ts_tick+0x8

74copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Lock Statistics - lockstatR/W writer blocked by writer: 1 events in 10.015 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller------------------------------------------------------------------------------- 1 100% 100% 1.00 169634 0x9d42d620 segvn_pagelock+0x150-------------------------------------------------------------------------------

R/W reader blocked by writer: 3 events in 10.015 seconds (0 events/sec)

Count indv cuml rcnt nsec Lock Caller------------------------------------------------------------------------------- 3 100% 100% 1.00 1841415 0x75b7abec mir_wsrv+0x18-------------------------------------------------------------------------------

Page 38: Evolving Solaris Kernel

75copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

lockstat - kernel profiling# lockstat -kIi997 sleep 10

Profiling interrupt: 10596 events in 5.314 seconds (1994 events/sec)

Count indv cuml rcnt nsec CPU+PIL Caller------------------------------------------------------------------------------- 5122 48% 48% 1.00 1419 cpu[0] default_copyout 1292 12% 61% 1.00 1177 cpu[1] splx 1288 12% 73% 1.00 1118 cpu[1] idle 911 9% 81% 1.00 1169 cpu[1] disp_getwork 695 7% 88% 1.00 1170 cpu[1] i_ddi_splhigh 440 4% 92% 1.00 1163 cpu[1]+11 splx 414 4% 96% 1.00 1163 cpu[1]+11 i_ddi_splhigh 254 2% 98% 1.00 1176 cpu[1]+11 disp_getwork 27 0% 99% 1.00 1349 cpu[0] uiomove 27 0% 99% 1.00 1624 cpu[0] bzero 24 0% 99% 1.00 1205 cpu[0] mmrw 21 0% 99% 1.00 1870 cpu[0] (usermode) 9 0% 99% 1.00 1174 cpu[0] xcopyout 8 0% 99% 1.00 650 cpu[0] ktl0 6 0% 99% 1.00 1220 cpu[0] mutex_enter 5 0% 99% 1.00 1236 cpu[0] default_xcopyout 3 0% 100% 1.00 1383 cpu[0] write 3 0% 100% 1.00 1330 cpu[0] getminor 3 0% 100% 1.00 333 cpu[0] utl0 2 0% 100% 1.00 961 cpu[0] mmread 2 0% 100% 1.00 2000 cpu[0]+10 read_rtc

76copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Kernel Process Model• Processes

• All processes begin life as a program• All processes begin life as a disk file (ELF object)• All processes have “state” or context that defines their execution

environment - hardware & software context

• Hardware context

• The processor state, which is CPU architecture dependent.• In general, the state of the hardware registers (general registers,

privileged registers)• Maintained in the LWP

• Software context

• Address space, credentials, open files, resource limits, etc - stuffshared by all the threads in a process

• can be further divided into “hardware” context and “software”context

Page 39: Evolving Solaris Kernel

77copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Dispatcher Views

proc

ess

proc

ess

software context:open files, credentials,address space,process group,session control,...

user

thre

ad

user

thre

ad

user

thre

ad

kernel dispatcherview.

CPU

kthre

adLW

Pm

achin

esta

te

kthre

adLW

Pm

achin

esta

te

LWP

LWP

user

thre

ad

user

thre

ad

user

thre

ad

software context:open files, credentials,address space,process group,session control,...

unbound user threadsare scheduled withinthe threads library,where the selecteduser thread is linkedto an available LWP.

kthre

adLW

Pm

achin

esta

te

kthre

adLW

Pm

achin

esta

te

LWP

LWP

This does not apply tobound threads

78copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Dispatcher & Scheduling Classes

• Solaris supports multiple scheduling classes

• Allows for the co-existence of different priority schemes andscheduling algorithms (policies) within the kernel

• Each scheduling class provides a class-specific function tomanage thread priorities, administration, creation, termination, etc.

• The class-specific functions are called using a MACRO scheme,similar to what is used at the VFS layer

...CL_PREEMPT(thread) -> ts_preempt()...

• Each scheduling class is assigned a range of priorities• For each loaded scheduling class, the priority-range falls within the

systems total range of global priorities

• The dispatcher is the kernel sunsystem that manages thedispatch queues (run queues), handles thread selection,context switching, preemption, etc

Page 40: Evolving Solaris Kernel

79copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Scheduling Classes

• SunOS currently implements the following schedulingclasses

• Timeshare (TS)• Fixed Priority (FX)• Fair Share (FSS)• Interactive (IA)• System (SYS)• Realtime (RT)

0

5960

99100

159160169

timesharing

system

realtime

interruptinterrupt threadpriorities above systemif realtime class isnot loaded, priorities 100-109.

lowest (worst)priority

highest (best)priority

and interactive

80copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Scheduling Classes - Priorities

0

169

glob

al p

riorit

y ra

nge

timeshare

realtime

0

59

-60

+60

user priority range

interactive

-60

+60

user priority rangesystem

interruptuser priority rangeints

10

1

Page 41: Evolving Solaris Kernel

81copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Quick Tidbit

• Use dispadmin(1M) or mdb(1) for scheduling class info

# dispadmin -lCONFIGURED CLASSES==================

SYS (System Class)TS (Time Sharing)FX (Fixed Priority)IA (Interactive)

# mdb -k> ::classSLOT NAME INIT FCN CLASS FCN 0 SYS sys_init sys_classfuncs 1 TS ts_init ts_classfuncs 2 FX fx_init fx_classfuncs 3 IA ia_init ia_classfuncs 4 0 0 5 0 0

• Note the RT class is not loaded

82copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Thread Priorities & Scheduling

• Every thread has 2 priorities; a global priority, derived basedon its scheduling class, and (potentially) and inheritedpriority

• Priority inherited from parent, alterable via priocntl(1)command or system call

• Typically, threads run as either TS or IA threads

• IA threads created when thread is associated with a windowingsystem

• RT threads are explicitly created

• SYS class used by kernel threads, and for TS/IA threads whena higher priority is warranted

• A temporary boost when an important resource is being held

• Interrupts run at interrupt priority

Page 42: Evolving Solaris Kernel

83copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

File System TypesFilesystem Type Device Description

ufs Regular Disk Unix Fast Filesystem, default in Solaris

pcfs Regular Disk MSDOS filesystem

hsfs Regular Disk High Sierra File System (CDROM)

tmpfs Regular Memory Uses memory and swap

nfs Psuedo Network Network filesystem

cachefs Psuedo FilesystemUses a local disk as cache for anotherNFS file system

autofs Psuedo FilesystemUses a dynamic layout to mount otherfile systems

specfs PsuedoDeviceDrivers

Filesystem for the /dev devices

procfs Psuedo Kernel /proc filesystem representing processes

sockfs Psuedo Network Filesystem of socket connections

fifofs Psuedo Files FIFO File System

84copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The virtual file system framework

System Call Interface

VFS- File System Independant Layer (VFS & VNODE INTERFACES)

UFS PCFS HSFS VxFS NFS PROCFS

rea

d()

write

()

op

en

()

clo

se()

mkd

ir()

rmd

ir()

ren

am

e()

link(

)

un

link(

)

see

k()

fsyn

c()

ioct

l()

cre

at(

)

Kernel

mo

un

t()

um

ou

nt(

)

sta

tfs(

)

syn

c()

VNODE OPERATIONS VFS OPERATIONS

Page 43: Evolving Solaris Kernel

85copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The VFS Interface

VFS

ufsnfs

VFS Type

VFSOP_xxx

Mount Point

ufs_mount()ufs_unmount()ufs_root()ufs_statvfs()ufs_sync()ufs_vget()ufs_mountroot()ufs_swapvp()

mount()unmount()root()statvfs()sync()vget()mountroot()swapvp()

etc...

Index into vfssw[]blocksizeflagsdevicesynclisthashlist

vnode

vfs_sw[] //usr/var/opt

*rootvfs

86copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

The vnode interface

VNODE

Regular FileDirectoryBlock DeviceCharacter DeviceLinkFIFOProcessSocket

VNODE Type

close()read()write()ioctl()create()link()..

VNODE Ops

Memory Pages

FilesystemPointer

ufs_close()ufs_read()ufs_write()ufs_ioctl()ufs_create()ufs_link()..

Page 44: Evolving Solaris Kernel

87copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

File system Caching• Solaris file systems use the VM system to cache

and move data

• Regular reads are page ins, delayed writes arepage outs

• VM Parameters and load dramatically effects filesystem performance

• Solaris 8 gives executable, stack and heap pages priorityover file system pages

88copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

File System Caching

Binary (Text)Binary (Data)

Stack

Heap

mmap()

STDIO

Buffers

Level 1 Page Cache

segmap page cache

(256MB on Ultra)

Level 2 Page Cache

Dynamic Page Cache

read()write() fread()

fwrite()

Buffer Cache

(BUFHWM)

Inode Cache

(ufsninode)

Directory

CacheName

(ncsize)

The cache hit ratio ofthe segmap cache canbe measured withnetstat -k segmap

File name lookups

Storage Devices

Files mapped withmmap() buypassthe segmap cache

The DNLCcache hit ratiocan be observedwith netstat -s

The buffercache hitratio can beobserved withsar -b

direct.blocks

Page 45: Evolving Solaris Kernel

89copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS• Block based allocation

• 2TB Max file system size• A file can grow to the max file system size

• triple indirect is implemented

• Prior to 2.6, max file size is 2GB

90copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS Block Allocation# filestat /home/bigfile

Inodes per cyl group: 64Inodes per block: 64Cylinder Group no: 0Cylinder Group blk: 64File System Block Size: 8192Device block size: 512Number of device blocks: 204928

Start Block End Block Length (Device Blocks)----------- ----------- ---------------------- 66272 -> 66463 192 66480 -> 99247 32768 1155904 -> 1188671 32768 1277392 -> 1310159 32768 1387552 -> 1420319 32768 1497712 -> 1530479 32768 1607872 -> 1640639 32768 1718016 -> 1725999 7984 1155872 -> 1155887 16

Number of extents: 9Average extent size: 22769 Blocks

Note: The filestat command is show for demonstration purposes, and is not as yetincluded with the Solaris operating system

Page 46: Evolving Solaris Kernel

91copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS Logging

• Beginning in Solaris 7, UFS logging became a mount option

• Log to spare blocks in the file system (no metadevice)

• Fast reboots - no fsck required

92copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS Direct I/O• File systems cause a lot of paging activity

• Solaris 2.6 introduces a mechanism to bypass theVM system

• Forces completely unbuffered I/Os• Very slow writes (synchronous)• Useful for copying large files or when application does caching e.g.

Oracle• mount -o forcedirectio /dev/xyz /mountpt• directio (fd, DIRECTIO_ON | DIRECTIO_OFF)

Page 47: Evolving Solaris Kernel

93copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

Direct I/O Checklist• Must be aligned

• sector aligned (512 byte boundary)

• Must not be mapped

• Buffer must be word aligned

94copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS Write Throttle• A throttle exists in UFS to limit the amount of

memory UFS can saturate, per file• Controlled by three parameters• ufs_WRITES (1 = enabled)• ufs_HW = 393216 bytes (high water mark to suspend IO)• ufs_LW = 262144 bytes (low water mark to start IO)

• Almost always need to set this higher to getmaximum sequential write performance

• set ufs_LW=4194304• set ufs_HW=67108864

Page 48: Evolving Solaris Kernel

95copyright (c) 2002 Jim Mauro and Richard McDougall Nov 2002

The Evolving Solaris Kernel

UFS Performance• Adjacent blocks are grouped and written together

or read ahead• Controlled by the maxcontig parameter• Defaults to 128k on most platforms, 1MB on SPARCstorage array

100,200• Must be set higher to achieve adequate write performance• maxphys must be raised beyond 128k also