18
LVC21F-204 Secure Container Runtime on Arm Howard Zhang Jianyong Wu

LVC21F-204 Secure Container Runtime on ARM

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

LVC21F-204 Secure Container Runtime on ArmHoward ZhangJianyong Wu

Agenda

● How containers contain

● Why containers do not contain

● gVisor introduction

● Why gVisor is secure

● gVisor on arm64

● Future plans for gVisor

How containers contain

○ Run as unprivileged users

■ Limited device access include network device

■ Limited file access

■ Permissions checked before execution

■ Limited ability to send signals

○ Drop Capabilities

■ SYS_ADMIN, SYS_NICE, NET_ADMIN

■ SYS_RAWIO, SYS_MODULE, etc…

○ Cgroups for resource isolation

■ Cgroups limit, account and isolate resource groups

● CPU, CPU set, Devices, memory

○ Namespaces for name isolation

■ Provide isolation for namespaces

● Network, PID, mount, user, IPC, UTS

user 123 /bin/bash

User / group

Namespace

Cgroups

Capabilities

Why containers do not contain

● A single vulnerability could lead to privilege escalations or information leaks

○ CVE-2016-5195 DirtyCOW

○ CVE-2017-2753/5715/5754 Spectre/Meltdown

○ CVE-2021-22555 Linux Netfilter

● Network stack is shared across containers

○ CVE-2013-4348 A single malformed packet can crash host kernel

The root cause is that host and container share the same host kernel and device drivers. A single

bug in kernel may make all container compromised.

➢ gVisor – userspace kernel

➢ Kata – lightweight Virtual Machine

One attemption – gVisor

1. Binary that packages gVisor, like runc

2. Untrusted Application

3. Plantform Syscall Switcher▪ Interceptions of syscalls

▪ Basic context switching

▪ Memory mapping

4. The application kernel in userspace

5. A process handles I/O

6. Seccomp filter

Why gVisor is secure1. The sandbox provides first layer

isolation

3. Sentry is written in pure GO and most

system calls are emulated in sentry ()

4. Seccomp filters are applied to

filter syscall passed to host kernel

2. Plantform Syscall Switcher providesstrong Isolation, especially the memory

5. Gofer isolate sentry from I/O

6. Network Stack (three host OS

syscalls)

Net Stack

1. Increase isolation layer

2. Reduce attack surface

gVisor on arm64

● All tests have passed

○ 213 syscall tests for KVM platform

○ 696 syscall tests for ptrace platform

● General usage is fine

○ Compile kernel in gVisor

○ Various application benchmarks

■ Sysbench, redis, fio, tensorflow, iperf, nginx, ffmpeg ….

● Tunings

○ Apply bitmap to fd_table

■ Improve 25% speed when create large number of fd, like socket

○ Fix the mmap cache error

■ Improve 100% speed when doing mmap

Future plans for gVisor

1. Make tuning on gVisor

○ arm64 vs x86_64 (fio)

○ Runsc vs Runc

○ Userspace Network Stack

2. Apply new features

○ Virtio

○ IO Uring

Agenda

● What’s Kata Containers

● Kata Containers status on arm64

○ UEFI + ACPI

○ Memory hotplug

○ Virtio-fs

○ VMM(Qemu, cloud hypervisor…)

○ KVM PTP clock

● In the future

○ Cpu hotplug

○ Performance test

○ Confidential compute

What’s Kata Containers

● Kata Containers is a secure

container runtime based on

lightweight virtual machines

○ Isolated by virtual machine

○ Compatible with OCI, offer full

Linux environment

○ Good performance

○ Simplicity

CRI client(k8s)

CRI(containerd)

containerd-shim-kata-v2

agent

Guest kernel

container

container

Kata Container

VMM/KVM

Kata containers status on arm64

● Enable UEFI + ACPI in kata

○ Set path of uefi image to “pflash” in kata configuration

○ Memory hotplug is available

○ There is extra latency

● Start kata without UEFI

○ Common out the Pflash term in configuration

○ Get better performance in startup time

○ No ACPI and memory hotplug support

Qemu load uefi

image

UEFI

image

boot uefi

Install Acpi in

uefi

initialize Acpi

in kernel

Kata containers status on arm64

● Memory hotplug is enabled on arm64

○ Base on ACPI

○ Upgrade guest kernel to 5.10

■ Memory hotplug support in linux kernel 5.7+ on arm64

○ Thanks @yuanzhe-liu0 to twist memory add strategy in qemu

■ Originally, there is a memory hotplug block size gap between qemu and kernel which lead to memory address alignment issue in memory add.

○ Try memory hotplug using ctr in kata

■ $ sudo ctr image pull docker.io/library/ubuntu:latest

■ $ sudo ctr run --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/ubuntu:latest hello sh -c "free -h"

■ $ sudo ctr run --runtime io.containerd.run.kata.v2 -t --memory-limit 536870912 --rm docker.io/library/ubuntu:latest hello sh -c "free -h“

Kata containers status on arm64

● VMM support status in kata on arm64

● KVM PTP clock

○ Time sync between host and VM in high precision

○ New feature in kernel 5.13, port to linux 5.10, 5.4 and 4.19 in kata on arm side, thanks @Damon

VMM Support CI Firmware

boot

Memory

hotplug

CPU

hotplug

Virtio-fs

Qemu Y Y Y Y N Y

Cloud

hypervisor

Y Y N N N Y

Firecracker Y N N N N N

In the future

● CPU hotplug

○ Related patches for kernel are under review

● Performance test

○ Should be done next phase, start up time, memory footprint, IO…

● Confidential compute

○ Armv9 CCA is released and can be used in kata to enhance its security

○ waiting for its ready in hardware and software

Comparation

Let’s compare runc, kata and gvisor in secure, startup time, memory footprint,

IO performance and compatibility.

runtime secure Startup

time

Memory

footprint

IO compatibility

runc ** ***** ***** ***** *****

kata ****** ** ** *** *****

gvisor ****** *** *** ** ****

*) The more ‘*’ the better

Thank youAccelerating deployment in the Arm Ecosystem