Upload
xiaofeng-guo
View
889
Download
0
Embed Size (px)
Citation preview
Yifan Gugithub.com/yifan-gu@yifan7702
With containers what does a "Linux Distro"
mean?
KERNELSYSTEMDSSH
PYTHONJAVANGINXMYSQLOPENSSL
dist
ro d
istr
o di
stro
dis
tro
dist
ro
dist
ro d
istr
o
APP
KERNELSYSTEMDSSH
LXC/DOCKER/RKT
PYTHONJAVANGINXMYSQLOPENSSL
APPdi
stro
dis
tro
dist
ro d
istr
o di
stro
di
stro
dis
tro
The Bad$ python --versionPython 2.7.6$ python app-requiring-python3.py
$ python --versionPython 3.4.3$ python app-requiring-python2.py
package collisions
The Bad$ cat /etc/os-release | grep ^NAME= NAME=Fedora
$ rpm -i package-from-suse.rpm file /foo from install of package-from-suse.rpm conflicts with file from package-from-fedora
dependency namespacing
The Good$ gpg --list-only --import \
/etc/apt/trusted.gpg.d/*
gpg: key 2B90D010: public key "Debian Archive Automatic Signing Key (8/jessie) <[email protected]>" importedgpg: Total number processed: 1gpg: imported: 1 (RSA: 1)gpg: no ultimately trusted keys found
users control trust
The Good
$ rsync ftp.us.debian.org::debian \ /srv/mirrors/debian
$ dpkg -i \ /srv/mirrors/debian/kernel-image-3.16.0-4-amd64-di_3.16.7-ckt9-2_amd64.udeb
trivial mirroring and hosting
Linux Packages 2.0.deb and .rpm for containers
Container VS VM?● Lightweight (100s vs 10s)
● Easy to deploy● Less isolation?
What is container ?● Packaging with your apps with deps● Running in isolation (using namespace,
cgroups)
Why I want to use it?● Deploy faster● Run faster, run everywhere● Run in isolation
App Container (appc)github.com/appc
appc != rkt
Application Containersself-contained, portable
(decoupled from operating system)isolated (memory, network, …)
appc principlesWhy are we doing this?
OpenIndependent GitHub organisation
Contributions from Cloud Foundry, Mesosphere, Google, Red Hat
(and many others!)
Simple but efficientSimple to understand and implement, but eye to optimisation (e.g. content-based
caching)
SecureCryptographic image addressing
Image signing and encryptionContainer identity
Standards-basedWell-known tools (tar, gzip, gpg, http), extensible with modern technologies
(bittorrent, xz)
ComposableIntegrate with existing systems
Non-prescriptive about build workflowsOS/architecture agnostic
appc components
Image FormatApplication Container Imagetarball of rootfs + manifest
uniquely identified by ImageID (hash)
Image DiscoveryApp name → artifact
example.com/http-servercoreos.com/etcd
HTTPS + HTML
Executor (Runtime)grouped applicationsruntime environment
isolatorsnetworking
Metadata Servicehttp://$AC_METADATA_URL/acMetadata
container metadatacontainer identity (HMAC verification)
ACE validatoris this executor compliant with the spec?
$EXECUTOR run ace_validator.aci
appc community
github.com/cdaylward/libappc
C++ library for working with app containers
github.com/cdaylward/nosecone
C++ executor for running app containers
mesos (wip)https://issues.apache.org/jira/browse/MESOS-2162
github.com/3ofcoins/jetpack
FreeBSD Jails/ZFS-based executor(by @mpasternacki)
github.com/sgotti/acido
ACI toolkit (build ACIs from ACIs)
github.com/appc/docker2acidocker2aci busybox/latest
docker2aci quay.io/coreos/etcd
github.com/appc/goaci
goaci github.com/coreos/etcd
appc spec in a nutshell
- Image Format (ACI)- what does an application consist of?
- Image Discovery- how can an image be located?
- Pods- how can applications be grouped and run?
- Executor (runtime)- what does the execution environment look like?
appc statusStabilising
towards first backwardscompatible release
github.com/coreos/rkt
rktan implementation of appc
Open standards. Composability.
rkt
rkta modern, secure container runtime
rktsimple CLI tool
simple CLI toolgolang + Linuxself-contained
init system/distro agnostic
simple CLI toolno daemon
no API*apps run directly under spawning process
bash
rkt
application(s)
runit
rkt
application(s)
systemd
rkt
application(s)
rkt internalsmodular architecture
execution divided into stagesstage0 → stage1 → stage2
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
stage0 (rkt binary)discover, fetch, manage application images
set up pod filesystemscommands to manage pod lifecycle
stage0 (rkt binary)
- rkt run
- rkt prepare
- rkt run-prepared
- rkt list
- rkt status
- ...
- rkt fetch
- rkt trust
- rkt image list
- rkt image export
- rkt image gc
- ...
stage0 (rkt binary)file-based locking for concurrent operation
(e.g. rkt gc, rkt list for pods)database + reference counting for images
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
stage1execution environment for pods
app process lifecycle managementisolators
stage1 (swappable)binary ABI with stage0
stage0 calls an execve(stage1)
stage1 (swappable)
● default implementation ○ based on systemd-nspawn+systemd ○ Linux namespaces + cgroups for isolation
● kvm implementation ○ based on lkvm+systemd ○ hardware virtualisation for isolation
● others?
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
rkt (stage0)
pod (stage1)
bash/runit/systemd/... (invoking process)
app1 (stage2)
app2 (stage2)
stage2actual app execution
independent filesystems (chroot)shared namespaces, volumes, IPC, ...
rkt + systemdThe different ways rkt integrates with
systemd
rkt
rkt
systemd (on host)(systemctl)
systemd (on host)optional
"systemctl stop" just workssocket activation
pod-level isolators: CPUShares, MemoryLimit
rkt
systemd-nspawn
systemd (on host)(systemctl)
systemd-nspawndefault stage1, besides lkvm
taking care of most of the low-level things
rkt
systemd-nspawn
systemd
systemd (on host)(systemctl)
container
systemdpid1
service filessocket activation
rkt
systemd-nspawn
application
systemd
systemd (on host)(systemctl)
container
applicationapp-level isolators: CPUShares, MemoryLimit
chrooted
rkt
systemd-nspawn
application
systemd-journald(journalctl)
logs
systemd
systemd (on host)(systemctl)
container
systemd-journaldno changes in apps required
logs in the containeravailable from the host with journalctl -m / -M
rkt
systemd-nspawn
application
systemd-machined(machinectl)
systemd-journald(journalctl)
logs
systemd
register
systemd (on host)(systemctl)
container
systemd-machinedregister on distros using systemd
machinectl {show,status,poweroff…}
rkt
systemd-nspawn
application
systemd-machined(machinectl)
systemd-journald(journalctl)
logs
systemd
register
systemd (on host)(systemctl)
container
cgroups
What’s a control group? (cgroup)
● group processes together● organised in trees● applying limits to them as a group
cgroups
cgroup API
/sys/fs/cgroup/*//proc/cgroups/proc/$PID/cgroup
List of cgroup controllers/sys/fs/cgroup/
├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd
/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice
How systemd units use cgroups
│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...
/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...
How systemd units use cgroups w/ containers
/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...
│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...
cgroups mounted in the container
RW
RO
Example: memory isolator
“limit”: “500M”
ApplicationImage Manifest
[Service]ExecStart=MemoryLimit=500M
systemd service file
write tomemory.limit_in_
bytes
systemd action
Example: CPU isolator
“limit”: “500m”
ApplicationImage Manifest
write tocpu.share
systemd action
[Service]ExecStart=CPUShares=512
systemd service file
Unified cgroup hierarchy● Multiple hierarchies:
○ one cgroup mount point for each controller (memory, cpu, etc.)○ flexible but complex○ cannot remount with a different set of controllers○ difficult to give to containers in a safe way
● Unified hierarchy:○ cgroup filesystem mounted only one time○ still in development in Linux: mount with option
“__DEVEL__sane_behavior”○ initial implementation in systemd-v226 (September 2015)○ no support in rkt yet
rkt: a few other things
- rkt and security- rkt API service (new!)- rkt networking- rkt and user namespaces- rkt and production
rkt and security"secure by default"
rkt security
- image signature verification- privilege separation
- e.g. fetch images as non-root user- SELinux integration- kernel keyring integration (soon)- lkvm stage1 for true hardware isolation
rkt API service (new!)optional, gRPC-based API daemon
exposes information on pods and imagesruns as unprivileged user
easier integration with other projects
rkt networkingplugin-based
Container Networking Interface (CNI)
Container Runtime (e.g. rkt)
veth macvlan ipvlan OVS
Container Networking Interface (CNI)
Networking, the rkt way
Network tooling
● Linux can create pairs of virtual net interfaces
● Can be linked in a bridge
container1 container2
eth0
veth1
eth0
veth2
IP masquerading via iptables
eth0
bridge
rkt and user namespaces
History of Linux namespaces✓ 1991: Linux
✓ 2002: namespaces in Linux 2.4.19
✓ 2008: LXC✓ 2011: systemd-nspawn✓ 2013: user namespaces in Linux 3.8✓ 2013: Docker✓ 2014: rkt
… development still active
Why user namespaces?
● Better isolation● Run applications which would need more
capabilities● Per user limits● Future?
○ Unprivileged containers: possibility to have container without root
0
host
65535
4,294,967,295(32-bit range)
0
container 1655350
container 2
User ID ranges
unmapped
User ID mapping/proc/$PID/uid_map: “0 1048576 65536”
host
container
1048576
65536
65536
unmappedunmapped
Problems with container images
Container filesystem
Container filesystem
Overlayfs “upper” directory
Overlayfs “upper” directory
Application Container Image (ACI)
ApplicationContainer
Image (ACI)
container 1 container 2
downloading
web server
Problems with container images
● Files UID / GID● rkt currently only supports user namespaces
without overlayfs○ Performance loss: no COW from overlayfs○ “chown -R” for every file in each container
Problems with volumes
/
/home/var
user
/
/data /my-app
bind mount(rw / ro)
/data
● mounted in several containers
● No UID translation
● Dynamic UID maps
/data
User namespace and filesystem problem
● Possible solution: add options to mount() to apply a UID mapping
● rkt would use it when mounting:○ the overlay rootfs○ volumes
● Idea suggested on kernel mailing lists
rkt and production
- still pre-1.0- unstable (but stabilising) CLI and API- explicitly not recommended for production
- although some early adopters
rkt v1.0.0EOY (fingers crossed)
stable APIstable CLI
ready to use!
Questions?
github.com/coreos/rkt
coreos.com/careers (soon in Berlin!)
Join us!