Container: is it safe enough to run you application?

Preview:

Citation preview

Container SecurityAleksey Zalesov

Agenda1. Who am I?

2. Technology: namespaces, resources and security policies

3. How to escape container and how to prevent the app from doing this?

Who am I?Cloud Foundry engineer @ Altoros

PhysTech biophysics background

Mountaineering

Technology

namespaces, resources and security policies

NamespacesFilesystem

PIDs

UIDs

network

hostname

PID namespaces

In kernel 2.6.24 since 2008

http://lwn.net/Articles/259217/

CLONE_NEWPID

http://linux.die.net/man/2/clone

Before PID namespacesstruct pid {

atomic_t count;/* reference counter */

int nr; /* the pid value */

struct hlist_node pid_chain; /* hash chain */

struct hlist_head tasks[PIDTYPE_MAX]; /* lists of tasks */

struct rcu_head rcu;/* RCU helper */

};

After PID namespaces struct upid {

int nr; /* moved from struct pid */

struct pid_namespace *ns; /* the namespace this value is visible in */

struct hlist_node pid_chain; /* moved from struct pid */

};

After PID namespaces struct pid {

atomic_t count;

struct hlist_head tasks[PIDTYPE_MAX];

struct rcu_head rcu;

int level;/* the number of upids */

struct upid numbers[0];

};

Playing with PID namespacesstatic char child_stack[1048576];

static int child_fn() {

printf("PID: %ld\n", (long)getpid());

return 0;

}

Playing with PID namespacesint main() {

pid_t child_pid = clone(child_fn, child_stack+1048576,

CLONE_NEWPID | SIGCHLD, NULL);

printf("clone() = %ld\n", (long)child_pid);

waitpid(child_pid, NULL, 0);

return 0;

}

Playing with PID namespaces$ sudo ./pid

clone() = 13175

PID: 1

ResourcesCPU

I/O

memory

network

https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

Demo: limiting I/O# yum install libcgroup

# sudo service cgconfig start

# ls /cgroup/

blkio cpu cpuacct cpuset devices freezer memory net_cls

Demo: limiting I/O# cat /etc/cgconfig.conf

group limitio{

blkio {

blkio.throttle.read_bps_device = "8:0 2097152";

}

}

Demo: limiting I/O# vim /etc/cgrules.conf

*:hdparm blkio limitio/

# service cgred start

CGROUP_DAEMON="blkio:limitio"

Demo: limiting I/O# hdparm --direct -t /dev/sda

/dev/sda:

Timing O_DIRECT disk reads: 1332 MB in 3.01 seconds = 443.04 MB/sec

/dev/sda:

Timing O_DIRECT disk reads: 6 MB in 3.00 seconds = 2.00 MB/sec

Security policiesdropping capabilities

seccomp

SELinux

Writing policies is hard!

Who should write the policy for container?

Capabilitiesa partitioning of the all powerful root privilege into a set of distinct privileges

per-thread attribute

http://man7.org/linux/man-pages/man7/capabilities.7.html

https://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.2/capfaq-0.2.txt

Capabilities example$ ls -l /bin/ping

-rwsr-xr-x. 1 root root 38200 Jul 23 17:55 /bin/ping

$ cp /bin/ping .

$ ./ping ya.ru

ping: icmp open socket: Operation not permitted

Capabilities example$ sudo /usr/sbin/setcap cap_net_raw=ep ./ping

$ ./ping ya.ru

PING ya.ru (213.180.193.3) 56(84) bytes of data.

64 bytes from www.yandex.ru (213.180.193.3): icmp_seq=1 ttl=63 time=3.56

64 bytes from www.yandex.ru (213.180.193.3): icmp_seq=2 ttl=63 time=6.05

$ /usr/sbin/getcap ./ping

./ping = cap_net_raw+ep

How to escape from container?

setuid executables in a container may compromise security on the host

https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1244635

/var/lib/lxc/NAME/rootfs/usr/bin/sudo -i

root access to the container

unprivileged access to the host

How to deal with it# chmod 0700 /var/lib/lxc

# docker run -it --rm --cap-drop SETUID --cap-drop SETGID ...

// mounting filesystem with the nosuid attribute

# find / -perm -4000 -exec ls -l {} \; 2>/dev/null

# sudo chmod u-s filename

same UIDsleak from one container running UID 1000 to another: the same user on host

Solution

user namespaces

container 1: UID 1000 => UID 12437

container 2: UID 1000 => UID 13337

Privileged containershttps://www.exploit-db.com/exploits/33808/ for Docker 0.11

- container can access ANY file on your host

Separate pid and net namespace, stripped caps and RO bind mounts

BUT we have dac_override and dac_read_search

open_by_handle_at()

Inode of / is always 2

Privileged containersRun trusted code only

Map groups of mutually-trusted containers to separate machines

Use SELinux or AppArmor

Drop capabilities

Buggy Kernel callsall containers share the same kernel, so if the kernel call allows gaining extra

access or leak data, container will get it

exploit example: vmsplice

prevent:

limit attack surface through security policies

use separate kernels on hypervisor

http://lwn.net/Articles/268783/

https://www.win.tue.nl/~aeb/linux/hh/hh-12.html

Sign your images!

Conclusioncontainers are fast and provide fine level of isolation for trusted apps

effort required for safely running untrusted apps in non-privileged mode

not safe to run untrusted apps in privileged mode

Thank you

Contact info:Aleksey Zalesov

aleksey.zalesov@altoros.com

skype: aleksey_zalesov

twitter: @azalesov

Recommended