56
Cgroups, namespaces and beyond: What are containers made from? Jérôme Petazzoni Tinkerer Extraordinaire

Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Embed Size (px)

Citation preview

Page 1: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Cgroups, namespaces and beyond:What are containers made from?

Jérôme🐳🚢📦PetazzoniTinkerer Extraordinaire

Page 2: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Who am I?Jérôme Petazzoni (@jpetazzo)French software engineer living in CaliforniaI have built and scaled the dotCloud PaaSalmost 5 years ago, time flies!

I talk (too much) about containersfirst time was maybe in February 2013, at SCALE in L.A.

Proud member of the House of Bashwhen annoyed, we replace things with tiny shell scripts

Page 3: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

OutlineWhat is a container?High level and low level overviews

The building blocksNamespaces, cgroups, copy-on-write storage

Container runtimesDocker, LXC, rkt, runC, systemd-nspawnOpenVZ, Jails, Zones

Hand-crafted, house-blend, artisan containersDemos!

Page 4: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

What is a container?📦

Page 5: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

High level approach: lightweight VMI can get a shell on itthrough SSH or otherwise

It "feels" like a VMown process spaceown network interfacecan run stuff as rootcan install packagescan run servicescan mess up routing, iptables …

Page 6: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Low level approach: chroot on steroidsIt's not quite like a VM:uses the host kernelcan't boot a different OScan't have its own modulesdoesn't need init as PID 1doesn't need syslogd, cron...

It's just normal processes on the host machinecontrast with VMs which are opaque

Page 7: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structuresor other unrelated concepts like “ACPI containers”

There are some references to “our” containersbut only in the documentation

Page 8: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structuresor other unrelated concepts like “ACPI containers”

There are some references to “our” containersbut only in the documentation

??? Are containers even in the kernel ???

Page 9: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

How are they implemented?Let's look in the kernel source!

Go to LXRLook for "LXC" → zero resultLook for "container" → 1000+ resultsAlmost all of them are about data structuresor other unrelated concepts like “ACPI containers”

There are some references to “our” containersbut only in the documentation

??? Are containers even in the kernel ?????? Do containers even actually EXIST ???

Page 10: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

The building blocks:control groups

Page 11: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Control groupsResource metering and limitingmemoryCPUblock I/Onetwork*

Device node (/dev/*) access controlCrowd control

*With cooperation from iptables/tc; more on that later

Page 12: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

GeneralitiesEach subsystem has a hierarchy (tree)separate hierarchies for CPU, memory, block I/O…

Hierarchies are independentthe trees for e.g. memory and CPU can be different

Each process is in a node in each hierarchythink of each hierarchy as a different dimension or axis

Each hierarchy starts with 1 node (the root)Initially, all processes start at the root node*Each node = group of processessharing the same resources

*It's a bit more subtle; more on that later

Page 13: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Example

Page 14: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Memory cgroup: accountingKeeps track of pages used by each group:file (read/write/mmap from block devices)anonymous (stack, heap, anonymous mmap)active (recently accessed)inactive (candidate for eviction)

Each page is “charged” to a groupPages can be shared across multiple groupse.g. multiple processes reading from the same fileswhen pages are shared, only one group “pays” for a page

Page 15: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Memory cgroup: limitsEach group can have its own limitslimits are optionaltwo kinds of limits: soft and hard limits

Soft limits are not enforcedthey influence reclaim under memory pressure

Hard limits will trigger a per-group OOM killerLimits can be set for different kinds of memoryphysical memorykernel memorytotal memory

Page 16: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Memory cgroup: avoiding OOM killerDo you like when the kernel terminates randomprocesses because something unrelated ate allthe memory in the system?Setup oom-notifierThen, when the hard limit is exceeded:freeze all processes in the groupnotify user space (instead of going rampage)we can kill processes, raise limits, migrate containers ...when we're in the clear again, unfreeze the group

Page 17: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Memory cgroup: tricky detailsEach time the kernel gives a page to a process,or takes it away, it updates the countersThis adds some overheadThis cannot be enabled/disabled per processit has to be done at boot time

When multiple groups use the same page,only the first one is “charged”but if it stops using it, the charge is moved to another group

Page 18: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

HugeTLB cgroupControls the amount of huge pagesusable by a processsafe to ignore if you don’t know what huge pages are

Page 19: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Cpu cgroupKeeps track of user/system CPU timeKeeps track of usage per CPUAllows to set weightsCan't set CPU limitsOK, let's say you give N%then the CPU throttles to a lower clock speednow what?same if you give a time slotinstructions? their exec speed varies wildly¯\_ツ_/¯

Page 20: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Cpuset cgroupPin groups to specific CPU(s)Reserve CPUs for specific appsAvoid processes bouncing between CPUsAlso relevant for NUMA systemsProvides extra dials and knobsper zone memory pressure, process migration costs…

Page 21: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Blkio cgroupKeeps track of I/Os for each groupper block deviceread vs writesync vs async

Set throttle (limits) for each groupper block deviceread vs writeops vs bytes

Set relative weights for each groupNote: most writes go through the page cacheso classic writes will appear to be unthrottled at first

Page 22: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Net_cls and net_prio cgroupAutomatically set traffic class or priority,for traffic generated by processes in the groupOnly works for egress trafficNet_cls will assign traffic to a classclass then has to be matched with tc/iptables,otherwise traffic just flows normally

Net_prio will assign traffic to a prioritypriorities are used by queuing disciplines

Page 23: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Devices cgroupControls what the group can do on device nodesPermissions include read/write/mknodTypical use:allow /dev/{tty,zero,random,null} ...deny everything else

A few interesting nodes:/dev/net/tun (network interface manipulation)/dev/fuse (filesystems in user space)/dev/kvm (VMs in containers, yay inception!)/dev/dri (GPU)

Page 24: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Freezer cgroupAllows to freeze/thaw a group of processesSimilar in functionality to SIGSTOP/SIGCONTCannot be detected by the processesthat’s how it differs from SIGSTOP/SIGCONT!

Doesn’t impede ptrace/debuggingSpecific usecasescluster batch schedulingprocess migration

Page 25: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

SubtletiesPID 1 is placed at the root of each hierarchyNew processes start in their parent’s groupsGroups are materialized by pseudo-fstypically mounted in /sys/fs/cgroup

Groups are created by mkdir in the pseudo-fsmkdir /sys/fs/cgroup/memory/somegroup/subcgroup

To move a process:echo $PID > /sys/fs/cgroup/.../tasks

The cgroup wars: systemd vs cgmanager vs ...

Page 26: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

The building blocks:namespaces

Page 27: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

NamespacesProvide processes with their own system viewCgroups = limits how much you can use;namespaces = limits what you can seeyou can’t use/affect what you can’t see

Multiple namespaces:pidnetmntutsipcuser

Each process is in one namespace of each type

Page 28: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Pid namespaceProcesses within a PID namespace only seeprocesses in the same PID namespaceEach PID namespace has its own numberingstarting at 1

If PID 1 goes away, whole namespace is killedThose namespaces can be nestedA process ends up having multiple PIDsone per namespace in which its nested

Page 29: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Net namespace: in theoryProcesses within a given network namespaceget their own private network stack, including:network interfaces (including lo)routing tablesiptables rulessockets (ss, netstat)

You can move a network interface across netnsip link set dev eth0 netns PID

Page 30: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Net namespace: in practiceTypical use-case:use veth pairs(two virtual interfaces acting as a cross-over cable)

eth0 in container network namespacepaired with vethXXX in host network namespaceall the vethXXX are bridged together(Docker calls the bridge docker0)

But also: the magic of --net containershared localhost (and more!)

Page 31: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Mnt namespaceProcesses can have their own root fsconceptually close to chroot

Processes can also have “private” mounts/tmp (scoped per user, per service...)masking of /proc, /sysNFS auto-mounts (why not?)

Mounts can be totally private, or sharedNo easy way to pass along a mountfrom a namespace to another ☹

Page 32: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Uts namespacegethostname / sethostname'nuff said!

Page 33: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Ipc namespace

Page 34: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Ipc namespaceDoes anybody know about IPC?

Page 35: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Ipc namespaceDoes anybody know about IPC?Does anybody care about IPC?

Page 36: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Ipc namespaceDoes anybody know about IPC?Does anybody care about IPC?Allows a process (or group of processes)to have own:IPC semaphoresIPC message queuesIPC shared memory

... without risk of conflict with other instances

Page 37: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

User namespaceAllows to map UID/GID; e.g.:UID 0→1999 in container C1 is mapped toUID 10000→11999 on host;UID 0→1999 in container C2 is mapped toUID 12000→13999 on host;etc.

UID in containers becomes irrelevantjust use UID 0 in the containerit gets squashed to a non-privileged user outside

Security improvementBut: devil is in the detailsJust landed in Docker experimental branch!

Page 38: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Namespace manipulationNamespaces are created with the clone() syscalli.e. with extra flags when creating a new process

Namespaces are materialized by pseudo-files/proc/<pid>/ns

When the last process of a namespace exits,it is destroyedbut can be preserved by bind-mounting the pseudo-file

It’s possible to “enter” a namespace with setns()exposed by the nsenter wrapper in util-linux

Page 39: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

The building blocks:copy-on-write

Page 40: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Copy-on-write storageCreate a new container instantlyinstead of copying its whole filesystem

Storage keeps track of what has changedMany options availableAUFS, overlay (file level)device mapper thinp (block level)BTRFS, ZFS (FS level)

Considerably reduces footprint and “boot” timesSee also: Deep dive into Docker storage drivers

Page 41: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Other details

Page 42: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

OrthogonalityAll those things can be used independentlyIf you only need resource isolation,use a few selected cgroupsIf you want to simulate a network of routers,use network namespacesPut the debugger in a container's namespaces,but not its cgroupsso that the debugger doesn’t steal resources

Setup a network interface in a container, then move it to anotheretc.

Page 43: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Some missing bitsCapabilitiesbreak down "root / non-root" into fine-grained rightsallow to keep root, but without the dangerous bitshowever: CAP_SYS_ADMIN remains a big catchall

SELinux / AppArmor ...containers that actually containdeserve a whole talk on their owncheck https://github.com/jfrazelle/bane(AppArmor profile generator)

Page 44: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Container runtimes

Page 45: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Container runtimesusing cgroups+namespaces

Page 46: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

LXCSet of userland toolsA container is a directory in /var/lib/lxcSmall config file + root filesystemEarly versions had no support for CoWEarly versions had no support to move imagesRequires significant amount of elbow greaseeasy for sysadmins/ops, hard for devs

Page 47: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

systemd-nspawnFrom its manpage:“For debugging, testing and building”“Similar to chroot, but more powerful”“Implements the Container Interface”

Seems to position itself as plumbingRecently added support for ɹǝʞɔop images

#define INDEX_HOST "index.do" /* the URL we get the data from */ "cker.io”#define HEADER_TOKEN "X-Do" /* the HTTP header for the auth token */ "cker-Token:” #define HEADER_REGISTRY "X-Do" /*the HTTP header for the registry */ "cker-Endpoints:”

Page 48: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Docker EngineDaemon controlled by REST(ish) APIFirst versions shelled out to LXCnow uses its own libcontainer runtime

Manages containers, images, builds, and moreSome people think it does too many things

Page 49: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

rkt, runCBack to the basics!Focus on container executionno API, no image management, no build, etc.

They implement different specifications:rkt implements appc (App Container)runC implements OCP (Open Container Project)runC leverages Docker's libcontainer

Page 50: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Which one is best?They all use the same kernel featuresPerformance will be exactly the sameLook at:featuresdesignecosystem

Page 51: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Container runtimesusing other mechanisms

Page 52: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

OpenVZAlso LinuxOlder, but battle-testede.g. Travis CI gives you root in OpenVZ

Tons of neat features tooploop (efficient block device for containers)checkpoint/restore, live migrationvenet (~more efficient veth)

Still developed and maintained

Page 53: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Jails / ZonesFreeBSD / SolarisCoarser granularity of featuresStrong emphasis on securityGreat for hosting providersNot so much for developerswhere's the equivalent of docker run -ti ubuntu?

Illumos branded zones can run Linux binarieswith Joyent Triton, you can run Linux containers on Illumos

This is different from the Oracle Solaris portthe latter will run Solaris binaries, not Linux binaries

Page 54: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Building our own containers

Page 55: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Not for production use!

(For educational purposes only)

55

Page 56: Cgroups, namespaces, and beyond: what are containers made from? (DockerCon Europe 2015)

Thank you!Jérôme🐳🚢📦Petazzoni@[email protected]