60
م حی ر ل ن ا حم ر ل م ال ا بس م حی ر ل ن ا حم ر ل م ال ا بسSharif University of Technology Data and Network Security Lab. Lightweight Virtualization in Linux Lightweight Virtualization in Linux Sadegh Dorri N. Sadegh Dorri N. PhD Candidate PhD Candidate Data and Network Security Lab. Seminar, 4 Aban 1393

Lightweight Virtualization in Linux

Embed Size (px)

Citation preview

Page 1: Lightweight Virtualization in Linux

بسم ال الرحمن الرحیمبسم ال الرحمن الرحیمSharif University of

TechnologyData and Network

Security Lab.

Lightweight Virtualization in LinuxLightweight Virtualization in Linux

Sadegh Dorri N.Sadegh Dorri N.

PhD CandidatePhD Candidate

Data and Network Security Lab. Seminar, 4 Aban 1393

Page 2: Lightweight Virtualization in Linux

The Need for VirtualizationThe Need for Virtualization

Hypervisors are the living proof of operating system's incompetence!

Scheduling a Multi-process “application”Scheduling a Multi-process “application”- Nice, priority, etc. are hard to be dynamically managed

Kernel Memory ManagementKernel Memory Management- Fork bumps- $ while true; do mkdir x; cd x; done

Abuse should be the application's problem, rather Abuse should be the application's problem, rather than being everyone's!than being everyone's!

The failure of operating systems and how we can fix it: http://lwn.net/Articles/524952/

Page 3: Lightweight Virtualization in Linux

AgendaAgenda

MotivationMotivation- Virtualization architectures- OS-level virtualization in Linux

A demoA demo

Under the hoodUnder the hood- LXC components- Related kernel features: cgroups and namespaces

Security considerationsSecurity considerations

ConclusionConclusion

Page 4: Lightweight Virtualization in Linux

Various Virtualization ArchitecturesVarious Virtualization Architectures

Page 5: Lightweight Virtualization in Linux

Hardware VirtualizationHardware Virtualization

VMware, Parallels, QEmu, Bochs, Xen, KVMVMware, Parallels, QEmu, Bochs, Xen, KVM

Resources cannot be shared between VMs.Resources cannot be shared between VMs.

Page 6: Lightweight Virtualization in Linux

OS-Level VirtualizationOS-Level Virtualization

Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo Linux Containers (LXC), Linux-VServer, OpenVZ, Parallels Virtuozzo ContainersContainers

FreeBSD jailsFreeBSD jails

Solaris Containers/ZonesSolaris Containers/Zones

IBM AIX6 WPARs (Workload Partitions)IBM AIX6 WPARs (Workload Partitions)

Page 7: Lightweight Virtualization in Linux

OS-Level Virtualization in LinuxOS-Level Virtualization in Linux

Linux ContainersLinux Containers- Allow a kernel to support more resource-isolation use-

cases- Without the overhead and complexity of running multiple

kernel and driver instances

BenefitsBenefits- Isolation- Small footprint- Speed

Page 8: Lightweight Virtualization in Linux

3) Speed3) Speed

Page 9: Lightweight Virtualization in Linux

2) Footprint2) Footprint

On a typical physical server, with average compute On a typical physical server, with average compute resources, you can easily run:resources, you can easily run:- 10-100 virtual machines- 100-1000 containers

On disk, containers can be very light.On disk, containers can be very light.- A few MB — even without fancy storage.

Page 10: Lightweight Virtualization in Linux

1) Isolation1) Isolation

Each container has:Each container has:

Its own network interface (and IP address)Its own network interface (and IP address)- can be bridged, routed... just like VMs

Its own filesystemIts own filesystem- Debian host can run Fedora container (& vice-versa)

Isolation (security)Isolation (security)- container A & B can't harm (or even see) each other

Isolation (resource usage)Isolation (resource usage)- soft & hard quotas for RAM, CPU, I/O...

Possibility of process checkpoint/freeze and migrationPossibility of process checkpoint/freeze and migration- Isolation prevents resource name conflicts

Page 11: Lightweight Virtualization in Linux

Use-Cases: DevelopersUse-Cases: Developers

Continuous IntegrationContinuous Integration- After each commit, run 100 tests in 100 environments

Continuous PackagingContinuous Packaging- Example: Project Builder

Escape dependency hellEscape dependency hell- Build (and/or run) in a controlled environment

Put everything in a containerPut everything in a container- Even the tiny things

Page 12: Lightweight Virtualization in Linux

Use-Cases: Hosting ProvidersUse-Cases: Hosting Providers

CheapCheap Cheaper Hosting (VPS providers) Cheaper Hosting (VPS providers)

Give away more free stuffGive away more free stuff- "Pay for your production, get your staging for free!"- Spin up/down on demand, in seconds- Example: dotCloud

““Google has built their entire datacenter infrastructure around Linux containers, launching more than

2 billion containers per week.””

(Kubernetes: open source Google cloud platform)(Kubernetes: open source Google cloud platform)

Page 13: Lightweight Virtualization in Linux

Use-Cases: EveryoneUse-Cases: Everyone

Look inside your VMsLook inside your VMs- You can see (and kill) individual processes- You can browse (and change) the filesystem

Do (almost) whatever you did with VMsDo (almost) whatever you did with VMs- ... But faster

MigrationMigration- Checkpoint then unfreeze: experimental (CRIU)

Page 14: Lightweight Virtualization in Linux

Solutions in LinuxSolutions in Linux

Page 15: Lightweight Virtualization in Linux

OpenVZOpenVZ

Modified Linux kernelModified Linux kernel- Also works with unpatched Linux 3.x (reduced feature set)

Each container is a separate entity with its own:Each container is a separate entity with its own:- Files: System libraries, applications, virtualized /proc and /sys,

virtualized locks, etc.- Users and groups: its own root user, as well as other users and

groups.- Process tree: only sees its own processes (incl. init)- Network: virtual network device with own IP addresses, iptables,

and routing rules.- Devices: can be granted access to real devices.- IPC objects: shared memory, semaphores, messages.

Page 16: Lightweight Virtualization in Linux

LXC (LinuX Containers)LXC (LinuX Containers)

Container: Container: - Provides an env. like a standard Linux installation but without the

need for a separate kernel.- Single kernel and drivers, multiple different user spaces

A group of processes in Linux in an isolated environment.A group of processes in Linux in an isolated environment.- From inside: looks like a VM- From outside: looks like normal processes- Something (conceptually) in the middle between a chroot on steroids

and a full fledged VM

LXC vs. OpenVZLXC vs. OpenVZ- OpenVZ: production ready and stable; pushing to the upstream- LXC: a work-in-progress; uses standard kernel features

Page 17: Lightweight Virtualization in Linux

LXC LifecycleLXC Lifecycle

lxc-createlxc-create- Setup a container (root filesystem and

config)

lxc-startlxc-start- Boot the container (by default, you get a

console)

lxc-consolelxc-console- Attach a console (if you started in

background)

lxc-stoplxc-stop- Shutdown the container

lxc-destroylxc-destroy- Destroy the filesystem created with lxc-

create

See also: LXC Web Panel - http://lxc-webpanel.github.io/

Page 18: Lightweight Virtualization in Linux

Demo...Demo...

Page 19: Lightweight Virtualization in Linux

Under the HoodUnder the Hood

Page 20: Lightweight Virtualization in Linux

LXC ComponentsLXC Components

Components:Components:- The liblxc library- Several language bindings for the API:

● Python, lua, Go, ruby, Haskell

- A set of standard tools to control the containers- Container templates

Open source!Open source!

https://linuxcontainers.org/https://linuxcontainers.org/

Page 21: Lightweight Virtualization in Linux

Features Making up LXC Features Making up LXC

Kernel features used in LXC:Kernel features used in LXC:- Isolation:

● Kernel namespaces (ipc, uts, mount, pid, network and user)● Chroots (using pivot_root)

- Resource management● Control groups (cgroups)

- Security:● Apparmor and SELinux profiles● Seccomp policies● Kernel capabilities

Page 22: Lightweight Virtualization in Linux

Pivot_root and ChrootPivot_root and Chroot

Change the root directory to a Change the root directory to a new pathnew path- Pivot_root: switches the

complete system and remove dependencies on the old root dir.

- Chroot: applied on a single process

Page 23: Lightweight Virtualization in Linux

SeccompSeccomp

seccomp (SECure COMPuting mode) seccomp (SECure COMPuting mode) - A simple sandboxing mechanism (Linux 2.6.12+ (2005))- Allows a process to make a one-way transition into a "secure" state

● Syscalls limited to exit(), sigreturn(), read() and write() to already-open file descriptors.

- Any attempts for other system calls result in SIGKILL.

seccomp-bpf seccomp-bpf - An extension to seccomp that allows filtering of system calls using a

configurable policy - Used by OpenSSH and vsftpd as well as Google Chrome/Chromium

on Chrome OS and Linux to sandbox Flash player and renderers.

Page 24: Lightweight Virtualization in Linux

CapabilitiesCapabilities

In traditional UNIX, processes are:In traditional UNIX, processes are:- Privileged (EUID is 0): Bypass all kernel permission checks.- Unprivileged: full permission checking (EUID, EGID, and supplementary group

list).

Since Linux kernel 2.2:Since Linux kernel 2.2:- The superuser privileges are divided into distinct units (a.k.a. as capabilities)- Capabilities can be independently enabled and disabled (per-thread)

Examples:Examples:- CAP_CHOWN: Make arbitrary changes to file UIDs and GIDs.- CAP_KILL: Bypass permission checks for sending signals.- CAP_NET_ADMIN: Perform various network-related operations.- CAP_SYS_ADMIN- CAP_SYS_BOOT: Use reboot and kexec_load

Page 25: Lightweight Virtualization in Linux

Linux Security Modules (LSM)Linux Security Modules (LSM)

A Linux kernel framework to support different A Linux kernel framework to support different security modelssecurity models- Avoids favoritism toward any single implementation. - Examples: AppArmor, SELinux, Smack and TOMOYO

Linux

Used to implement different Used to implement different MACsMACs

Access Control

Page 26: Lightweight Virtualization in Linux

Control GroupsControl Groups

Page 27: Lightweight Virtualization in Linux

Introduction to CGroupsIntroduction to CGroups

Cgroups (control groups): Cgroups (control groups): - Allocate resources (CPU, memory, network, or their combinations)

among user-defined groups of tasks (processes) - Think ulimit, but for groups of processes ... and with fine-grained

accounting.- Initiated at Google (2006)- Available in Fedora 18 kernel and ubuntu 12.10 kernel (also some

previous releases).

Commands:Commands:- cgcreate: creates new cgroup- cgset: sets parameters for given cgroup(s)- cgexec: runs a task in specified control groups.

Page 28: Lightweight Virtualization in Linux

CGroups: ImplementationCGroups: Implementation

Implemented as a special cgroup file systemImplemented as a special cgroup file system- libcgroup is a library that abstracts the control group file system in

Linux.- CGroup services: Allow persistence across reboot and ease of use.

A few simple hooks inserted into the kernel (not performance-A few simple hooks inserted into the kernel (not performance-critical):critical):- In boot phase, process creation and destroy methods, task_struct

procfs entries:procfs entries:● For each process: /proc/pid/cgroup.● System-wide: /proc/cgroups

Page 29: Lightweight Virtualization in Linux

CGroup SubsystemsCGroup Subsystems

cpucpu- control CPU scheduler

cpuacctcpuacct- generates automatic reports on CPU resources

cpusetcpuset- assigns individual CPUs (cores) and memory nodes

memorymemory- limits memory use + generates automatic reports on memory resources

freezerfreezer- suspends or resumes tasks in a cgroup.

Page 30: Lightweight Virtualization in Linux

CGroups Subsystems (cont'd)CGroups Subsystems (cont'd)

blkioblkio- limits on block devices IO (disk, solid state, USB, etc.).

devices:devices:- allows/denies access to devices

net_clsnet_cls- differentiates between packets of different cgroups.

net_prionet_prio- dynamically set the priority of network traffic per network

interface.

Page 31: Lightweight Virtualization in Linux

Cgroups: BasicsCgroups: Basics

Everything exposed through a virtual filesystemEverything exposed through a virtual filesystem- /cgroup, /sys/fs/cgroup... YourMountpointMayVary

Create a cgroup:Create a cgroup:- mkdir /cgroup/aloha- Automatically creates these files: tasks, tasks, cgroup.procs,

etc.

Move process with PID 1234 to the cgroup:Move process with PID 1234 to the cgroup:- echo 1234 > /cgroup/aloha/tasks

Limit memory usage:Limit memory usage:- echo 10000000 > /cgroup/aloha/memory.limit_in_bytes

Page 32: Lightweight Virtualization in Linux

CPUset SubsystemCPUset Subsystem

Each subsystem adds specific control files for its Each subsystem adds specific control files for its own needsown needs- Prefixed by its name

cpuset.cpuscpuset.sched_relax_domain_level

cpuset.memscpuset.memory_migrate

cpuset.cpu_exclusivecpuset.memory_pressure

cpuset.mem_exclusivecpuset.memory_spread_page

cpuset.mem_hardwallcpuset.memory_spread_slab

cpuset.sched_load_balancecpuset.memory_pressure_enabled

Page 33: Lightweight Virtualization in Linux

CGroup: CPU (and Friends)CGroup: CPU (and Friends)

LimitingLimiting- Set cpu.shares (defines relative weights)

AccountingAccounting- Check cpustat.usage for user/system breakdown

IsolateIsolate- Use cpuset.cpus (also for NUMA systems)

Can't really throttle a group of process.Can't really throttle a group of process.- But that's OK: context-switching << 1/HZ

Page 34: Lightweight Virtualization in Linux

CGroup: MemoryCGroup: Memory

Up to 25 control filesUp to 25 control files

LimitingLimiting- memory usage, swap usage- soft limits and hard limits- can be nested

AccountingAccounting- cache vs. rss- active vs. inactive- file-backed pages vs. anonymous pages- page-in/page-out

IsolationIsolation- Reserve memory thanks to hard limits

Page 35: Lightweight Virtualization in Linux

CGroup: Block I/OCGroup: Block I/O

Limiting & IsolationLimiting & Isolation- blkio.throttle.{read,write}.{iops,bps}.device- Drawback: only for sync I/O (i.e.: "classical" reads; not

writes; not mapped files)

AccountingAccounting- Number of IOs, bytes, service time...- Drawback: same as previously

CGroups aren't perfect to limit I/OCGroups aren't perfect to limit I/O- Limiting the amount of dirty memory helps a bit.

Page 36: Lightweight Virtualization in Linux

NamespacesNamespaces

Page 37: Lightweight Virtualization in Linux

Linux NamespacesLinux Namespaces

Namespaces: Lightweight process virtualizationNamespaces: Lightweight process virtualization- Isolation: Enable a process (or several processes) to have

different views of the system than other processes.- Idea dates back to 1992 (Plan 9)

Introduced in Linux 2.4.19 (2002)Introduced in Linux 2.4.19 (2002)- User namespace was the last ns: A number of Linux filesystems

are not user-namespace aware, yet!

User space modificationUser space modification- No modification is needed (in general)- Some utilities are made namespace-aware (iproute, util-linux)

Page 38: Lightweight Virtualization in Linux

Different Kinds of NamespacesDifferent Kinds of Namespaces

There are currently 6 namespaces in Linux:There are currently 6 namespaces in Linux:- pid (processes)- net (network interfaces, routing...)- ipc (System V IPC)- mnt (mount points, filesystems)- uts (hostname)- user (UIDs)

4 other namespaces are not implemented (yet):4 other namespaces are not implemented (yet):- Security, security keys, device, and time

All require CAP_SYS_ADMINAll require CAP_SYS_ADMIN- Except user namespaces (not privileged)- All the other ones can be created in conjunction with a new user

namespace.

Page 39: Lightweight Virtualization in Linux

Namespaces ImplementationNamespaces Implementation

Namespaces do not have namesNamespaces do not have names- Each namespace has a unique inode number (Linux 3.8+)- inode number of each namespace is created when the namespace

is created.- There is an initial, default namespace for each namespace.

ls -al /proc/<pid>/ns ls -al /proc/<pid>/ns - lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]- lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]

Page 40: Lightweight Virtualization in Linux

Trivial NamespacesTrivial Namespaces

UTS (hostname)UTS (hostname)- gethostname(),sethostname()- struct system_utsname per container

Sys V IPC: shmem, semaphores, msg queuesSys V IPC: shmem, semaphores, msg queues- Keys must be mutually agreed upon by both client and

server processes- ipc namespace: uniqueness context of keys

Page 41: Lightweight Virtualization in Linux

Namespaces: pidNamespaces: pid

Usually a PID is an arbitrary numberUsually a PID is an arbitrary number

Special cases:Special cases:- Init (i.e. child reaper) has a PID of 1- Can't change PID (process migration)

Page 42: Lightweight Virtualization in Linux

Namespaces: pid (cont'd)Namespaces: pid (cont'd)

PID is no longer unique in kernelPID is no longer unique in kernel- A process has (can have) different PIDs in each ns- /proc/$PID/* is virtualized

PID namespaces are nestedPID namespaces are nested- Processes in a PID ns can't see/affect processes of the

parent ns- But all PIDs in the ns are visible to the parent ns.

Page 43: Lightweight Virtualization in Linux

PID 1PID 1

Each PID namespace has a PID #1Each PID namespace has a PID #1- Its first process

Behavior like the “init” process:Behavior like the “init” process:- When a process dies, all its orphaned children will

now have the process with PID 1 as their parent.- Sending SIGKILL signal does not kill process 1,

regardless of which namespace the command was issued (initial namespace or other pid namespace).

An important feature for containersAn important feature for containers

Page 44: Lightweight Virtualization in Linux

Namespaces: netNamespaces: net

Logically another copy of the network stack, with Logically another copy of the network stack, with its own separate...its own separate...- Network interfaces (and its own lo/127.0.0.1)- IP address(es) and sockets- routing table(s), iptables rules

Communication between containers:Communication between containers:- UNIX domain sockets (=on the filesystem)- By creating a pair of network devices (veth) and move

one to another namespace (like a pipe.)

Page 45: Lightweight Virtualization in Linux

Namespaces: mntNamespaces: mnt

In a new mount namespace:In a new mount namespace:- All previous mounts will be visible- But mounts/unmounts in that mount namespace are

invisible to the rest of the system.- Mounts/unmounts in the global namespace are visible in

that namespace.

A mnt namespace can have its own rootfsA mnt namespace can have its own rootfs

Special filesystems must be remounted, e.g.:Special filesystems must be remounted, e.g.:- procfs (to see the processes)- devpts (to see pseudo-terminals)

Page 46: Lightweight Virtualization in Linux

Namespaces: userNamespaces: user

A process will have distinct A process will have distinct set of UIDs, GIDs and set of UIDs, GIDs and capabilities.capabilities.- UID42 in container X isn't

UID42 in container Y

Page 47: Lightweight Virtualization in Linux

UID Namespace (example)UID Namespace (example)

Running from some user accountRunning from some user account- id -u → 1000 (effective user ID)- id -g → 1000 (effective group ID)

Capbilties: cat /proc/self/status | grep CapCapbilties: cat /proc/self/status | grep Cap- CapInh: 0000000000000000- CapPrm: 0000000000000000- CapEff: 0000000000000000- CapBnd: 0000001fffffffff

In order to create a user namespace and start a shell, we In order to create a user namespace and start a shell, we will run from that non-root account:will run from that non-root account:- unshare -U /bin/bash

Page 48: Lightweight Virtualization in Linux

Example (cont'd)Example (cont'd)

Now from the new shell runNow from the new shell run- id -u → 65534- id -g → 65534- These are default values for the eUID and eGUID In the new namespace.- No difference if unshare by the root user

Capabilities: cat /proc/self/status | grep CapCapabilities: cat /proc/self/status | grep Cap- CapInh:0000000000000000- CapPrm:0000000000000000- CapEff:0000000000000000- CapBnd:0000001fffffffff

In fact:In fact:- The namespace had full capabilities, but unshare removed them.- User mapping can be specified in gid_map and uid_map of the created

process.

Page 49: Lightweight Virtualization in Linux

unshare utilunshare util

Runs a program with some namespace(s) Runs a program with some namespace(s) unshared from parentunshared from parent

ExampleExample- ./unshare --net bash- A new network namespace was generated and the

bash process was generated inside that namespace.- Now ifconfig -a will show only the loopback device

After process termination, After process termination, - The namespace(s) will be freed.

Page 50: Lightweight Virtualization in Linux

Namespace problems / todosNamespace problems / todos

Missing namespaces: Missing namespaces: - tty, fuse, binfmt_misc

Identifying a namespaceIdentifying a namespace- No namespace ID, just process(es)- Partly solved by inode numbers

Entering existing namespacesEntering existing namespaces- fd=nsfd(NS, PID); setns(fd);- Were not possible in older kernels

Page 51: Lightweight Virtualization in Linux

Security of ContainersSecurity of Containers

Page 52: Lightweight Virtualization in Linux

Uncertaities, Fears and DoubtsUncertaities, Fears and Doubts

““LXC is not secure. If I want real security I'll use LXC is not secure. If I want real security I'll use KVM.”KVM.”- Dan Berrange, famous LXC hacker, 2011

Still quoted todayStill quoted today- Still true in some cases- Things have changed a little since 2011

ResponsesResponses- Kernel exploits- Default LXC settings- Containers needing to do funky stuff

Page 53: Lightweight Virtualization in Linux

Kernel ExploitsKernel Exploits

Kernel exploits: Containers share the kernelKernel exploits: Containers share the kernel- Buggy kernel and syscalls → Game over!- Unless the container is forbidden from those syscalls- seccomp-bpf

Page 54: Lightweight Virtualization in Linux

Default LXC SettingsDefault LXC Settings

Default LXC settingsDefault LXC settings- Apparmor and SELinux are used to restrict some

actions, by default (intension: stop accidential harm)- Capabilities must be restricted, too.

Full capabilities and permissions inside =

Sudoers access to the guest user!

Page 55: Lightweight Virtualization in Linux

Containers Needing Extra Priv.Containers Needing Extra Priv.

Network Interfaces for VPN or otherNetwork Interfaces for VPN or other

Multicast, broadcast, packet sniffingMulticast, broadcast, packet sniffing

Raw access to devices (disks, GPU, …)Raw access to devices (disks, GPU, …)

Mounting stuff (even with FUSE)Mounting stuff (even with FUSE)

More privileges=

Greater attack surface

Page 56: Lightweight Virtualization in Linux

Some SolutionsSome Solutions

Use LXC for packagingUse LXC for packaging- Containers are not used on the target host.

Use LXC for development and testingUse LXC for development and testing- Insider code- Prevents accidental harm to the systems

LXC for Web apps and databasesLXC for Web apps and databases- Shouldn't require extra privileges

Use capabilitiesUse capabilities

Defense in depth!Defense in depth!- Use multiple security mechanisms- Both in the container and in the host (not different from the usual!)

Page 57: Lightweight Virtualization in Linux

Other SolutionsOther Solutions

One container per machineOne container per machine- Containers for fast deployment: Docker

One VM per containerOne VM per container- Run untrusted code within a VM within a container

Page 58: Lightweight Virtualization in Linux

Conclusion Conclusion

Containers seem to be the future of virtualizationContainers seem to be the future of virtualization- Already used in production settings- E.g. in Google

A stack of open source solutions is there!A stack of open source solutions is there!- Linux- LXC- Docker, ProjectBuilder, Puppet, etc.- PaaS

Every technology has its own drawbacksEvery technology has its own drawbacks- Security is a strong concern!

Page 59: Lightweight Virtualization in Linux

Thank You!Thank You!

Page 60: Lightweight Virtualization in Linux

Useful ReferencesUseful References