55
Cgroups, namespaces and beyond: What are containers made from? Jérôme Petazzoni Tinkerer Extraordinaire

Cgroups, namespaces and beyond: what are containers made from?

Embed Size (px)

Citation preview

Page 1: Cgroups, namespaces and beyond: what are containers made from?

Cgroups, namespaces and beyond:What are containers made from?

Jérôme🐳🚢📦PetazzoniTinkerer Extraordinaire

Page 2: Cgroups, namespaces and beyond: what are containers made from?

Who am I?Jérôme Petazzoni (@jpetazzo)French software engineer living in CaliforniaI have built and scaled the dotCloud PaaS 

almost 5 years ago, time flies!I talk (too much) about containers 

first time was maybe in February 2013, at SCALE in L.A.Proud member of the House of Bash 

when annoyed, we replace things with tiny shell scripts

Page 3: Cgroups, namespaces and beyond: what are containers made from?

OutlineWhat is a container? 

High level and low level overviewsThe building blocks 

Namespaces, cgroups, copy-on-write storageContainer runtimes

Docker, LXC, rkt, runC, systemd-nspawnOpenVZ, Jails, Zones

Hand-crafted, artisan-pick, house-blend containers 

Demos!

Page 4: Cgroups, namespaces and beyond: what are containers made from?

What is a container?📦

Page 5: Cgroups, namespaces and beyond: what are containers made from?

High level approach: it's a lightweight VMI can get a shell on it 

through SSH or otherwiseIt "feels" like a VM

own process spaceown network interfacecan run stuff as rootcan install packagescan run servicescan mess up routing, iptables …

Page 6: Cgroups, namespaces and beyond: what are containers made from?

Low level approach: it's chroot on steroidsIt's not quite like a VM:

uses the host kernelcan't boot a different OScan't have its own modulesdoesn't need init as PID 1doesn't need syslogd, cron...

It's just a bunch of processes visible on the host machine 

contrast with VMs which are opaque

Page 7: Cgroups, namespaces and beyond: what are containers made from?

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structures 

or other unrelated concepts like “ACPI containers”There are some references to “our” containers

but only in the documentation

Page 8: Cgroups, namespaces and beyond: what are containers made from?

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structures 

or other unrelated concepts like “ACPI containers”There are some references to “our” containers

but only in the documentation??? Are containers even in the kernel ???

Page 9: Cgroups, namespaces and beyond: what are containers made from?

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structures 

or other unrelated concepts like “ACPI containers”There are some references to “our” containers

but only in the documentation??? Are containers even in the kernel ?????? Do containers even actually EXIST ???

Page 10: Cgroups, namespaces and beyond: what are containers made from?

The building blocks:control groups

Page 11: Cgroups, namespaces and beyond: what are containers made from?

Control groupsResource metering and limiting

memoryCPUblock I/Onetwork*

Device node (/dev/*) access controlCrowd control

*With cooperation from iptables/tc; more on that later

Page 12: Cgroups, namespaces and beyond: what are containers made from?

GeneralitiesEach subsystem (memory, CPU...) has

a hierarchy (tree)Hierarchies are independent 

the trees for e.g. memory and CPU can be differentEach process belongs to exactly 1 node in each

hierarchy think of each hierarchy as a different dimension or axis

Each hierarchy starts with 1 node (the root)All processes initially belong to the root of each

hierarchy*Each node = group of processes 

sharing the same resources

*It's a bit more subtle; more on that later

Page 13: Cgroups, namespaces and beyond: what are containers made from?

Example

Page 14: Cgroups, namespaces and beyond: what are containers made from?

Memory cgroup: accountingKeeps track of pages used by each group:

file (read/write/mmap from block devices)anonymous (stack, heap, anonymous mmap)active (recently accessed)inactive (candidate for eviction)

Each page is “charged” to a groupPages can be shared across multiple groups 

e.g. multiple processes reading from the same filesWhen pages are shared, only one group “pays”

for a page

Page 15: Cgroups, namespaces and beyond: what are containers made from?

Memory cgroup: limitsEach group can have (optional) hard and soft

limitsSoft limits are not enforced 

they influence reclaim under memory pressureHard limits will trigger a per-group OOM killerThe OOM killer can be customized (oom-

notifier);  when the hard limit is exceeded:

freeze all processes in the groupnotify user space (instead of going rampage)we can kill processes, raise limits, migrate containers ...when we're in the clear again, unfreeze the group

Limits can be set for physical, kernel, total memory

Page 16: Cgroups, namespaces and beyond: what are containers made from?

Memory cgroup: tricky detailsEach time the kernel gives a page to a

process,  or takes it away, it updates the counters

This adds some overheadUnfortunately, this cannot be enabled/disabled

per processit has to be done at boot time

When multiple groups use the same page, only the first one is “charged”

but if it stops using it, the charge is moved to another group using it

Page 17: Cgroups, namespaces and beyond: what are containers made from?

HugeTLB cgroupControls the amount of huge pages usable by a

processsafe to ignore if you don’t know what huge pages are

Page 18: Cgroups, namespaces and beyond: what are containers made from?

Cpu cgroupKeeps track of user/system CPU timeKeeps track of usage per CPUAllows to set weightsCan't set CPU limits

OK, let's say you give N%then the CPU throttles to a lower clock speednow what?same if you give a time slotinstructions? their exec speed varies wildly¯\_ツ _/¯

Page 19: Cgroups, namespaces and beyond: what are containers made from?

Cpuset cgroupPin groups to specific CPU(s)Reserve CPUs for specific appsAvoid processes bouncing between CPUsAlso relevant for NUMA systemsProvides extra dials and knobs 

per zone memory pressure, process migration costs…

Page 20: Cgroups, namespaces and beyond: what are containers made from?

Blkio cgroupKeeps track of I/Os for each group

per block deviceread vs writesync vs async

Set throttle (limits) for each groupper block deviceread vs writeops vs bytes

Set relative weights for each groupNote: most writes will go through the page

cache anywayso classic writes will appear to be unthrottled at first

Page 21: Cgroups, namespaces and beyond: what are containers made from?

Net_cls and net_prio cgroupAutomatically set traffic class or priority, 

for traffic generated by processes in the group

Only works for egress trafficNet_cls will assign traffic to a class 

class then has to be matched with tc/iptables,otherwise traffic just flows normally

Net_prio will assign traffic to a priority priorities are used by queuing disciplines

Page 22: Cgroups, namespaces and beyond: what are containers made from?

Devices cgroupControls what the group can do on device

nodesPermissions include read/write/mknodTypical use:

allow /dev/{tty,zero,random,null} ...deny everything else

A few interesting nodes:/dev/net/tun (network interface manipulation)/dev/fuse (filesystems in user space)/dev/kvm (VMs in containers, yay inception!)/dev/dri (GPU)

Page 23: Cgroups, namespaces and beyond: what are containers made from?

Freezer cgroupAllows to freeze/thaw a group of processesSimilar in functionality to mass

SIGSTOP/SIGCONTCannot be detected by the processes (unlike

STOP/CONT)Doesn’t impede ptrace/debuggingSpecific usecases

cluster batch schedulingprocess migration

Page 24: Cgroups, namespaces and beyond: what are containers made from?

SubtletiesPID 1 is placed at the root of each hierarchyNew processes are placed in their parent’s

groupsGroups are materialized by one (or multiple)

pseudo-fs typically mounted in /sys/fs/cgroup

Groups are created by mkdir in the pseudo-fsTo move a process:

echo $PID > /sys/fs/cgroup/.../tasksThe cgroup wars: systemd vs cgmanager vs ...

Page 25: Cgroups, namespaces and beyond: what are containers made from?

The building blocks:namespaces

Page 26: Cgroups, namespaces and beyond: what are containers made from?

NamespacesProvide processes with their own view of the

systemCgroups = limits how much you can use; 

namespaces = limits what you can see (and therefore use)

Multiple namespaces:pidnetmntutsipcuser

Each process is in one namespace of each type

Page 27: Cgroups, namespaces and beyond: what are containers made from?

Pid namespaceProcesses within a PID namespace only see

processes in the same PID namespaceEach PID namespace has its own numbering 

starting at 1When PID 1 goes away, the whole namespace

is killedThose namespaces can be nestedA process ends up having multiple PIDs 

one per namespace in which its nested

Page 28: Cgroups, namespaces and beyond: what are containers made from?

Net namespace: in theoryProcesses within a given network namespace 

get their own private network stack, including:

network interfaces (including lo)routing tablesiptables rulessockets (ss, netstat)

You can move a network interface from a netns to another

ip link set dev eth0 netns PID

Page 29: Cgroups, namespaces and beyond: what are containers made from?

Net namespace: in practiceTypical use-case:

use veth pairs (two virtual interfaces acting as a cross-over cable)

eth0 in container network namespacepaired with vethXXX in host network namespaceall the vethXXX are bridged together  (Docker calls the

bridge docker0)But also: the magic of --net container

shared localhost (and more!)

Page 30: Cgroups, namespaces and beyond: what are containers made from?

Mnt namespaceProcesses can have their own root fs (à la

chroot)Processes can also have “private” mounts

/tmp (scoped per user, per service...)masking of /proc, /sysNFS auto-mounts (why not?)

Mounts can be totally private, or sharedNo easy way to pass along a mount

from a namespace to another ☹

Page 31: Cgroups, namespaces and beyond: what are containers made from?

Uts namespacegethostname / sethostname

'nuff said!

Page 32: Cgroups, namespaces and beyond: what are containers made from?

Ipc namespace

Page 33: Cgroups, namespaces and beyond: what are containers made from?

Ipc namespaceDoes anybody know about IPC?

Page 34: Cgroups, namespaces and beyond: what are containers made from?

Ipc namespaceDoes anybody know about IPC?Does anybody care about IPC?

Page 35: Cgroups, namespaces and beyond: what are containers made from?

Ipc namespaceDoes anybody know about IPC?Does anybody care about IPC?Allows a process (or group of processes) to

have own:IPC semaphoresIPC message queuesIPC shared memory

... without risk of conflict with other instances

Page 36: Cgroups, namespaces and beyond: what are containers made from?

User namespaceAllows to map UID/GID; e.g.:

UID 0→1999 in container C1 is mapped to UID 10000→11999 on host;UID 0→1999 in container C2 is mapped to UID 12000→13999 on host;etc.

Avoids extra configuration in containersUID 0 (root) can be squashed to a non-

privileged userSecurity improvementBut: devil is in the details

Page 37: Cgroups, namespaces and beyond: what are containers made from?

Namespace manipulationNamespaces are created with

the clone() system call i.e. with extra flags when creating a new process

Namespaces are materialized by pseudo-files/proc/<pid>/ns

When the last process of a namespace exits, it is destroyed 

but can be preserved by bind-mounting the pseudo-fileIt is possible to "enter" a namespace

with setns() exposed by the nsenter wrapper in util-linux

Page 38: Cgroups, namespaces and beyond: what are containers made from?

The building blocks:copy-on-write

Page 39: Cgroups, namespaces and beyond: what are containers made from?

Copy-on-write storageCreate a new container instantly 

instead of copying its whole filesystemStorage keeps track of what has changedMany options available

AUFS, overlay (file level)device mapper thinp (block level)BTRFS, ZFS (FS level)

Considerably reduces footprint and “boot” times

See also: Deep dive into Docker storage drivers

Page 40: Cgroups, namespaces and beyond: what are containers made from?

Other details

Page 41: Cgroups, namespaces and beyond: what are containers made from?

OrthogonalityAll those things can be used independentlyUse a few cgroups if you just need resource

isolationSimulate a network of routers with network

namespacesPut the debugger in a container's

namespaces,  but not its cgroups (to not use its resource

quotas)Setup a network interface in an isolated

environment,  then move it to another

etc.

Page 42: Cgroups, namespaces and beyond: what are containers made from?

Some missing bitsCapabilities

break down "root / non-root" into fine-grained rightsallow to keep root, but without the dangerous bitshowever: CAP_SYS_ADMIN remains a big catchall

SELinux / AppArmor ...containers that actually containdeserve a whole talk on their owncheck https://github.com/jfrazelle/bane (AppArmor profile

generator)

Page 43: Cgroups, namespaces and beyond: what are containers made from?

Container runtimes

Page 44: Cgroups, namespaces and beyond: what are containers made from?

Container runtimesusing cgroups+namespaces

Page 45: Cgroups, namespaces and beyond: what are containers made from?

LXCSet of userland toolsA container is a directory in /var/lib/lxcSmall config file + root filesystemEarly versions had no support for CoWEarly versions had no support to move images

aroundRequires significant amount of elbow grease 

easy for sysadmins/ops, hard for devs

Page 46: Cgroups, namespaces and beyond: what are containers made from?

systemd-nspawnFrom its manpage:

“For debugging, testing and building”“Similar to chroot, but more powerful”“Implements the Container Interface”

Seems to position itself as plumbingRecently added support for ɹǝʞɔop images

#define INDEX_HOST "index.do" /* the URL we get the data from */ "cker.io”#define HEADER_TOKEN "X-Do" /* the HTTP header for the auth token */ "cker-Token:” #define HEADER_REGISTRY "X-Do" /*the HTTP header for the registry */ "cker-Endpoints:”

Page 47: Cgroups, namespaces and beyond: what are containers made from?

Docker EngineDaemon controlled by REST(ish) APIFirst versions shelled out to LXC

now uses its own libcontainer runtimeManages containers, images, builds, and moreSome people think it does too many things

Page 48: Cgroups, namespaces and beyond: what are containers made from?

rkt, runCBack to the basics!Focus on container execution 

no API, no image management, no build, etc.They implement different specifications:

rkt implements appc (App Container)runC implements OCP (Open Container Project) runC leverages Docker's libcontainer

Page 49: Cgroups, namespaces and beyond: what are containers made from?

Which one is best?They all use the same kernel featuresPerformance will be exactly the sameLook at:

featuresdesignecosystem

Page 50: Cgroups, namespaces and beyond: what are containers made from?

Container runtimesusing other mechanisms

Page 51: Cgroups, namespaces and beyond: what are containers made from?

OpenVZAlso LinuxOlder, but battle-tested 

e.g. Travis CI gives you root in OpenVZTons of neat features too

ploop (efficient block device for containers)checkpoint/restore, live migrationvenet (~more efficient veth)

Still developed and maintained

Page 52: Cgroups, namespaces and beyond: what are containers made from?

Jails / ZonesFreeBSD / SolarisCoarser granularity than namespaces and

cgroupsStrong emphasis on securityGreat for hosting providersNot so much for developers 

where's the equivalent of docker run -ti ubuntu?Note: Illumos branded zones can run Linux

binarieswith Joyent Triton, you can run Linux containers on Illumos

Note: this is different from the Oracle Solaris port

the latter will run Solaris binaries, not Linux binaries

Page 53: Cgroups, namespaces and beyond: what are containers made from?

Building our own containers

Page 54: Cgroups, namespaces and beyond: what are containers made from?

For educational purposes only!

54

Page 55: Cgroups, namespaces and beyond: what are containers made from?

Thank you!Jérôme Petazzoni🐳🚢📦@[email protected]