35

Docker

Embed Size (px)

Citation preview

An old interview question

• what happens when you open an website?

• https://github.com/alex/what-happens-when

What happens when you start a container with

docker?

A simple docker exampleroot@boot2docker:/home/docker# ip ad show eth14: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:91:99:33 brd ff:ff:ff:ff:ff:ff inet 192.168.59.103/24 brd 192.168.59.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe91:9933/64 scope link valid_lft forever preferred_lft foreverroot@boot2docker:/home/docker# root@boot2docker:/home/docker# docker run -d -P redis6f858e1563a56574031a61e65fb8ab356752d03440b24d65739eed64f2ef84dfroot@boot2docker:/home/docker# root@boot2docker:/home/docker# docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES6f858e1563a5 redis:latest "/entrypoint.sh redi 3 seconds ago Up 2 seconds 0.0.0.0:49154->6379/tcp kickass_colden root@boot2docker:/home/docker# root@boot2docker:/home/docker# docker run -it --entrypoint /bin/bash redisroot@63d30ea140b2:/data# redis-cli -h 192.168.59.103 -p 49154192.168.59.103:49154> set k 123OK192.168.59.103:49154> get k"123"

What happened here

• We created a container with its own filesystem, network stack, process space, resource limitation

• We started a redis-server in the container.

• We created another container. We ran redis-cli in it to connect to the preview redis-server with host ip and proxy port.

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

What is a redis imageFROM dockerfile/ubuntu

# Install Redis.RUN \ cd /tmp && \ wget http://download.redis.io/redis-stable.tar.gz && \ tar xvzf redis-stable.tar.gz && \ cd redis-stable && \ make && \ make install && \ cp -f src/redis-sentinel /usr/local/bin && \ mkdir -p /etc/redis && \ cp -f *.conf /etc/redis && \ rm -rf /tmp/redis-stable* && \ sed -i 's/^\(bind .*\)$/# \1/' /etc/redis/redis.conf && \ sed -i 's/^\(daemonize .*\)$/# \1/' /etc/redis/redis.conf && \ sed -i 's/^\(dir .*\)$/# \1\ndir \/data/' /etc/redis/redis.conf && \ sed -i 's/^\(logfile .*\)$/# \1/' /etc/redis/redis.conf

# Define mountable directories.VOLUME ["/data"]

# Define working directory.WORKDIR /data

# Define default command.CMD ["redis-server", "/etc/redis/redis.conf"]

# Expose ports.EXPOSE 6379

Image• A read-only Layer is called an image. An image

never changes.

• Each image may depend on one more image which forms the layer beneath it. We sometimes say that the lower image is the parent of the upper image.

• Each image may depend on one more image which forms the layer beneath it. We say that the lower image is the parent of the upper image.

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

How to make a image

• Use dockerfile

• Use docker commit manually (deprecated)

Create a root image

• https://github.com/docker/docker/blob/master/contrib/mkimage-busybox.sh

• https://github.com/docker/docker/blob/master/docs/articles/baseimages.md

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

What is a container?

• A Linux container is a copy of a Linux environment located in a file system which is jail environment but uses Linux NameSpaces, it runs its own init process, separate process space, separate filesystem and separate network stack which is virtualized by the root OS running on the hardware.

Concept of image and container

• Docker image is a layer in the file system

• Containers are two layers

- Layer one is init layer based on image

- Layer two is the actual container content

511136ea3c5a

df7546f9f060

ea13149945cb

4986bf8c1536

142b6a3eae40

142b6a3eae40-init

Container

Image

RW

RO

/dev/dev/console/dev/shm/etc/etc/hostname/etc/hosts/dev/mtab -> /proc/mounts

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

Linux kernel Namespace• UTS(hostname), Mount(mount points), IPC(System V

IPC), User(UIDs), Pid(processes), Net(network stack)

• The kernel namespace API, clone, setns, unshare

• /proc/[pid]/ns/ directory$ ls -l /proc/$$/nslrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 ipc -> ipc:[4026531839]lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 mnt -> mnt:[4026531840]lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 net -> net:[4026531956]lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 pid -> pid:[4026531836]lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 user -> user:[4026531837]lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 uts -> uts:[4026531838]

setns• reassociate process with a namespace

• int setns(int fd, int nstype);

• CLONE_NEWIPC/CLONE_NEWNET/CLONE_NEWNS/CLONE_NEWPID/CLONE_NEWUSER/CLONE_NEWUTS

• Each process has a /proc/[pid]/ns/ subdirectory containing one entry for each namespace that supports being manipulated by setns(2)

Join pid namespacefunc joinNS(namespaces []configs.Namespace) error { for _, ns := range namespaces { if ns.Path != "" { f, err := os.OpenFile(ns.Path, os.O_RDONLY, 0) if err != nil { return err } err = system.Setns(f.Fd(), uintptr(ns.Syscall())) f.Close() if err != nil { return err } } } return nil}

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

Storage Driver• Docker implements vfs, aufs, device mapper, btrfs,

overlayfs, zfs currently.

• Storage driver should have the following feather

- Copy on write

- Shared memory cache

• Performance http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/

Aufs• Work on File-level

• Combine multiple branches in a specific order

• Each branch is just a normal directory

• Opening a file

- look it up in each branch, starting from the top, open the first one if find

- If attempts writing into it, copy it to the read-write (top) branch, then open the copy

- That "copy-up" operation can take a while if the file is big!

• Deleting a file

- A whiteout file is created

Device Mapper

Device Mapper• Work on Block-level

• Each container and each image gets its own block device

• At any given time, it is possible to take a snapshot of a container or an image

• data/metadata is sparse file

• recommend to put data on real disk

loop0

data metadata

/dev/mapper/docker-{major}:{minor}-{indoor}-pool

loop0

volume1

volume2

How to make its owner filesystem

1. mount every parent layer and rw layer diff/$cid-init on mnt/$cid-init

2. make extra files, dir, links in mnt/$cid-init

3. mount every parent layer and rw layer diff/$cid and ro layer diff/$cid-init on mnt/$cid

4. setns to join existing mount namespace

5. mount proc/sysfs/tmpfs/cgroup…

6. create devices, setup dev symlinks, init filesystem

7. chdir diff/$cid && chroot .

note : underline parts made by initprocess, others made by docker daemon.

more in rootfs_linux.go

511136ea3c5a

df7546f9f060

ea13149945cb

4986bf8c1536142b6a3eae4

0

142b6a3eae40-init

/var/lib/docker/aufs/diff

/var/lib/docker/aufs/mnt

142b6a3eae40

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

Network mode

• Docker supports bridge/none/container/host mode

• How bridge mode work?

Bridge mode1. create docker0 bridge, add eth1 to docker0,

set up docker0 iptable rule

2. create a veth device, attach one to docker0, put another into container’s network namespace.

3. allocate a free ip

4. set up iptable rules and userland proxy

5. setns to join existing network namespace

6. change the name of veth device to eth1 in container

7. set mac address, ip, mtu of veth device

8. set up default gateway and route

note : underline parts made by initprocess, others made by docker daemon.

host

eth110.27.149.90

docker0172.17.42.1

contianer0eth1

172.17.0.4

vethdb6e696

contianer1eth1

172.17.0.5

veth8df64b7

veth device bridge physical device

Consistent mac address

• Docker generates mac addresse for veth device consistent for a given ip address.

• This can avoid arp cache issues

func generateMacAddr(ip net.IP) net.HardwareAddr { hw := make(net.HardwareAddr, 6) // The first byte of the MAC address has to comply with these rules: // 1. Unicast: Set the least-significant bit to 0. // 2. Address is locally administered: Set the second-least-significant bit (U/L) to 1. // 3. As "small" as possible: The veth address has to be "smaller" than the bridge address. hw[0] = 0x02 // The first 24 bits of the MAC represent the Organizationally Unique Identifier (OUI). // Since this address is locally administered, we can do whatever we want as long as // it doesn't conflict with other addresses. hw[1] = 0x42 // Insert the IP address into the last 32 bits of the MAC address. // This is a simple way to guarantee the address will be consistent and unique. copy(hw[2:], ip.To4()) return hw}

Port Mapping• Docker daemon use a map to record ports and ip mappings

• Connect to local subset

- userland proxy: docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 49153 -container-ip 172.17.0.2 -container-port 6379

- Hairpin nat (new docker versions)

- enable /sys/class/net/$vethname/brport/hairpin_mode

• Connect to others

- iptables -I POSTROUTING -t nat -s 172.17.42.1/16 ! -o docker0 -j MASQUERADE

- iptables -t nat -A DOCKER -p tcp -d 0/0 --dport 49153 ! -i docker0 -j DNAT --to-destination 172.17.0.2:6379

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

Cgroups support by docker• cgroup components: cpuset, cpu, cpuacct,

memory, devices, freezer, net_cls, blkio

• docker run option: --memory, --cpuset, --cpu-shares, --device

• docker pause/unpause

• After start background “docker native” process, docker daemon echo the pid of it to cgroup dirs like /cgroup/memory/docker/$cid/memory.limit_in_bytes

How this happened

• What is a redis image? How to make it?

• What is a container? How to make its own filesystem, network stack, process space, resource limitation?

• How container starts?

How container starts1. creates a socketpair and starts a background

child process “docker native”

2. create network devices and applies cgroup settings.

3. send configuration to “docker native”

4. receive error message, wait for “docker native” to exit

5. “docker native” receive config and env from socketpair

6. “docker native” join existing namespace with fd in /proc/$pid/ns/*

7. init file system…

8. exec entrypoint

“docker native” is the init process in container

daemon

docker native entrypoint

start config errors

exec

client

startcreate