1. parallels.com || openvz.org || criu.org Seven Problems of
Linux Containers Kir Kolyshkin 28 April 2013 LinuxFest
Northwest
2. parallels.com || openvz.org || criu.org Seventy Seven
Problems of Linux Containers Kir Kolyshkin 28 April 2013 LinuxFest
Northwest (of which I am going to cover six)
3. parallels.com || openvz.org || criu.org Problem 1: Effective
virtualization Virtualization is partitioning Historical way: $M
mainframes Modern way: virtual machines Problem: performance
overhead Partial solution: hardware support (Intel VT, AMD V)
4. parallels.com || openvz.org || criu.org Solution: isolation
Run many isolated userspace instances on top of one single (Linux)
kernel All processes see each other files, process information,
network, shared memory, users, etc. Make them unsee it!
5. parallels.com || openvz.org || criu.org
6. parallels.com || openvz.org || criu.org One historical way
to unsee chroot()
7. parallels.com || openvz.org || criu.org Namespaces
Implemented in the Linux kernel PID net IPC UTS mnt user clone()
with CLONE_NEW* flags
8. parallels.com || openvz.org || criu.org Problem 2: Shared
resources All containers share the same set of resources (CPU, RAM,
disk, various kernel things ...) Need fair distribution of goods so
everyone gets their share Need DoS prevention Need prioritization
All animals are equal, but some animals are more equal than others
-- George Orwell
9. parallels.com || openvz.org || criu.org
10. parallels.com || openvz.org || criu.org Solution: OpenVZ
resource controls OpenVZ: user beancounters controls 20 parameters
hierarchical CPU scheduler disk quota per containers I/O priorities
per-container Dynamic control, can resize runtime
11. parallels.com || openvz.org || criu.org Solution: cgroups
Cgroups is a mechanism to control resources per hierarchical groups
of processes Cgroups is nothing without controllers: blkio, cpu,
cpuacct, cpuset, devices, freezer, memory, net_cls, net_prio
Cgroups are orthogonal to namespaces Still a work in progress
(kernel memory)
12. parallels.com || openvz.org || criu.org Problem 3: easy
resources User Beancounters are complicated:
http://wiki.openvz.org/UBC_consistency_check user has to set all
these parameters some of which are interdependent We created a
collection of valid configs, ... wrote a whole book about UBC ...
and a set of tools to help
13. parallels.com || openvz.org || criu.org
14. parallels.com || openvz.org || criu.org Solution: VSwap
Only two primary parameters: RAM and swap others still exist, but
no longer required to set Swap is virtual, no actual I/O is
performed Slow down to emulate real swap Only when actual global
RAM shortage occurs, virtual swap goes into the real swap Currently
only available in OpenVZ kernel
15. parallels.com || openvz.org || criu.org Problem 4: fast
live migration We can migrate an OpenVZ container from one physical
server to another without a shutdown We want to do it fast even for
huge containers huge disk: use shared storage huge RAM: ???
16. parallels.com || openvz.org || criu.org Normal migration
process (Assuming shared storage) 1 Freeze the container 2 Dump its
complete state to a dump file 3 Copy dump file to destination
server 4 Undump 5 Unfreeze Problem: huge dump file
17. parallels.com || openvz.org || criu.org Solution 1: network
swap 1 Dump the minimal memory, lock the rest 2 Restore the minimal
memory, mark the rest as swapped out 3 Set up network swap from the
source 4 Unfreeze. Missing RAM will be swapped in 5 Migrate the
rest of RAM and kill it on source
18. parallels.com || openvz.org || criu.org
19. parallels.com || openvz.org || criu.org Solution 1: network
swap 1 Dump the minimal memory, lock the rest 2 Copy, undump what
we have, mark the rest as swapped out 3 Set up network swap served
from the source 4 Unfreeze. Missing RAM will be swapped in 5
Migrate the rest of RAM and kill it on source PROBLEM? Reliability,
no way to rollback
20. parallels.com || openvz.org || criu.org Solution 2:
Iterative RAM migration 1 Ask kernel to track modified pages 2 Copy
all memory to destination system 3 Ask kernel for list of modified
pages 4 Copy those pages 5 GOTO 3 until satisfied 6 Freeze and do
migration as usual
21. parallels.com || openvz.org || criu.org Problem 5:
upstreaming OpenVZ was developed separately Then we wanted to merge
it upstream (i.e. to vanilla Linux kernel) Problem?
22. parallels.com || openvz.org || criu.org
23. parallels.com || openvz.org || criu.org Problem 5:
upstreaming OpenVZ was developed separately Then we wanted to merge
it upstream (i.e. to vanilla Linux kernel) Problem: upstream devs
are not accepting our work
24. parallels.com || openvz.org || criu.org Solution 1: rewrite
from scratch User Beancounters -> CGroups Did 2 rewrites for PID
namespace until it finally got accepted Network namespace redone It
works! about 1500 patches got landed to vanilla II Parallels made
it to top10 contributors
25. parallels.com || openvz.org || criu.org Solution 2: CRIU We
tried hard to merge checkpoint/restore Other people tried hard too,
no luck Can't make it to the kernel, let's go userspace With
minimal kernel intervention when required Kernel exports most of
information already, so let's just add missing bits and pieces
26. parallels.com || openvz.org || criu.org CRIU Checkpoint /
Restore (mostly) In Userspace Tools currently at version 0.4 Will
do 1.0 release this year Kernel 3.8 has about 120 patches from us
95% of needed features are there Memory snapshot recently made it
to -mm tree
27. parallels.com || openvz.org || criu.org
28. parallels.com || openvz.org || criu.org Problem 6: common
file system Container is just a directory on host, all CTs reside
on the same FS File system journal is a bottleneck Lots of
small-size files I/O on CT backup No sub-tree disk quota support in
upstream No per-container snapshots Live migration: rsync --
changed inodes File system type and properties are fixed
29. parallels.com || openvz.org || criu.org Solution 1: LVM
Only works only on top of block device Hard to manage (e.g. how to
migrate huge volume?) No dynamic allocation Complicated
management
30. parallels.com || openvz.org || criu.org Solution 2: loop
device VFS operations leads to double page-caching (already fixed
in the recent kernels) No dynamic allocation, max space is used
Limited feature set
31. parallels.com || openvz.org || criu.org Solution 3: ploop
Basic idea: same as loop, just better Modular design: various image
formats (qcow2 in TODO) various I/O backends More features: live
resize instant live snapshots write tracker to help in live
migration