Exploiting the Linux Kernel via Intel's SYSRET ImplementationNiko@FluxFingers
Outline
●Syscalls and Context Switches●Canonical Addresses●SYSRET #GP Triggering●Step by Step Exploitation and Rooting
Linux x86_64 Syscalls
●On OLD x86 Processors int $0x80 with Nr. in %eax and Params in %ebx, %ecx, etc○However it’s super slow and got replaced with Intel’s
SYSENTER mechanism●x86_64 uses AMD’s SYSCALL with Params in %rdi, %
rsi, %rdx, %rcx, ...○ Faster to handle than the whole interrupt path○ Intel CPUs adapted SYSCALL according to AMD’s specs since it
became the standard syscall-mechanism
SYSCALL/SYSRET
●Whenever a syscall is invoked via SYSCALL a context switch to kernel mode takes place○When leaving the syscall the kernel needs to restore specific
userland registers ○And transfer back to ring3 with SYSRET
●SYSRET is fast since it “only” needs to:○ Load the saved %rip from %rcx○ Swap %cs back to ring3 mode
●The kernel itself has to make sure to restore all other userland registers before executing SYSRET
SYSCALL/SYSRET0x0000000000000000
0x0000000000400000Process (/bin/cat)
.text, .data, .bss, Heap0x00000000006XXXXX
Shared Libraries
0x00007ffffXXXXXXX
Stack
0x00007fXXXXXXXXXX
VSYSCALL
0xffffffffff600000
0xffffffff80000000
Kernel Memory
SYSCALL
SYSCALL/SYSRET0x0000000000000000
0x0000000000400000Process (/bin/cat)
.text, .data, .bss, Heap0x00000000006XXXXX
Shared Libraries
0x00007ffffXXXXXXX
Stack
0x00007fXXXXXXXXXX
VSYSCALL
0xffffffffff600000
0xffffffff80000000
Kernel MemorySYSRET
How Linux handles SYSRET
●arch/x86/kernel/entry_64.S:
ret_from_sys_call: movl $_TIF_ALLWORK_MASK,%edi...sysret_check:... movq RIP-ARGOFFSET(%rsp),%rcx CFI_REGISTER rip,rcx RESTORE_ARGS 1,-ARG_SKIP,0 movq PER_CPU_VAR(old_rsp), %rsp USERGS_SYSRET64
●The kernel makes sure to restore %rsp and %gs etc and calls SYSRET in the end
Canonical Addresses
●On x86_64 registers are 64 bit wide●The instruction pointer (%rip) can only use 48 bits
○ 48 Bits == balanced value for page-tables/accessible memory●Leftover bits of %rip used for CPU specific tricks
○ like NX bit on position 63●Meaning the value of %rip has to be “canonical” aka
between○0x0000000000000000 -> 0x00007FFFFFFFFFFF○0x00FFFFFFFFFFFFFF -> 0xFFFF800000000000
● (Bits 48 .. 63 have to be copies of bit 47)●Non-canonical values in %rip are not allowed and will
trigger exceptions in certain cases
Non-canonical addresses and SYSRET
●Whenever a SYSRET is executed and the CPU sees a non-canonical value in %rcx it triggers a #GP
●AMD specs however never defined when the #GP will actually happen
●Clever researches at XEN found out AMD CPUs will trigger #GP when back in Usermode
●Not so on Intel ...
Intel’s Version of SYSRET
●AMD’s specs omitted the check for non-canonical values in %rcx / %rip
● Intel decided to check for non-canonical values before the privilege level is changed
Intel’s Version of SYSRET
●Triggering a #GP from kernel mode has consequences on Linux
●Recall that prior to executing SYSRET Linux restores the userland %rsp and swaps %gs
● Intel’s SYSRET will #GP on the userland stack while still being in ring0
#GP on userland %rsp
●#GP is an exception reached via an IDT entry:arch/x86/kernel/traps.c:set_intr_gate(X86_TRAP_GP, general_protection);
●Where general_protection resolves to an error_entry macro in arch/x86/kernel/entry_64.S:
.macro errorentry sym do_sym ENTRY(\sym) XCPT_FRAME ASM_CLAC PARAVIRT_ADJUST_EXCEPTION_FRAME subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 call error_entry...
#GP on userland %rsp
● error_entry sets up an exception stack and backups all registers:ENTRY(error_entry) XCPT_FRAME CFI_ADJUST_CFA_OFFSET 15*8
cld movq_cfi rdi, RDI+8 movq_cfi rsi, RSI+8 movq_cfi rdx, RDX+8
…● where movq_cfi is defined as
.macro movq_cfi reg offset=0 movq %\reg, \offset(%rsp) CFI_REL_OFFSET \reg, \offset.endm
#GP on userland %rsp
●When setting up the stack frame in error_entry all (general) registers are saved to x(%rsp) / [rsp+x]
●The kernel restored the userland %rsp and registers before SYSRET
●=> Arbitrary memory write while in ring0●Classic possibility for privilege escalation
Linux’ Protection against n/c %rip
●This behaviour already bit Linux in 2006 (CVE-2006-0744)
●To make sure no code lands up in non-canonical address space (or right before) a guard page was introduced
●mmap(0x7ffffffff000, 4096, PROT_READ … will return ENOMEM
●This way SYSRET “shouldn’t” return to any n/c address
Linux’ Protection against n/c %rip
●Another possibility is using a “safe” IRET path for returning back to ring3○ IRET requires ring3-backup on the stack to return to user-code○ Is slower than SYSRET
●The ptrace interface sets an IRET path most of the time
●However some syscalls use a SYSRET path albeit being ptraced
●One example is fork() since it signals with ptrace_event() that does not force IRET
Crash PoC
● fork() a child●Child sets PTRACE_TRACEME●Raise SIGSTOP●Parent sets PTRACE_O_TRACEFORK●Child fork()s again●Parent catches this fork●And uses PTRACE_SETREGS to set %rip to n/c●Pivots %rsp to arbitrary place●And PTRACE_CONTINUEs●fork() will return with SYSRET with n/c %rcx●CPU will #GP, Pagefault, Doublefault and Panic
How to get root
The plan
●We need to get Kernel Code Execution between the #GP and Panic
●Then restore the damage we have done●Set credentials of current process to 0●Return back to userland●And open shell
The target
●Since #GP will always trigger a Pagefault and Doublefault we can pivot %rsp back to IDT
●And set 2 specific registers to craft a fake IDT gate●That will be placed instead of the orig Page- or
Doublefault handler.
IDT Layout
●We can read IDTR with the sidt-instruction
IDT Gate Entry
●And setup a new gate with modified “Offsets”
The target
●Before we trigger #GP we can allocate a Landing Area in Userland
●Where we copy code that will be executed●Craft a fake IDT gate that points to this area●Triggering #GP will then overwrite e.g. Doublefault
with the fake gate●And the kernel will jump to Userland and execute
our code with kernel privs
Kernel Shellcode
● Inside this code we will have to swapgs in order to access kernel structures
●Then we carefully rebuild all IDT entries that were trashed in the overwrite process
●Then we can raise process credentials
Process structures
●Each process in userland has an associated kernel structure (thread_union) that builds the kernel stack:
Kernel Stack
thread_info
thread_union
Process structures
●thread_info itself has an element that points to task_struct
…
*task_struct
thread_info
*exec_domain
Process structures < 2.6.29
●task_struct contains lots of info about the running task
●and its credentials
...uid, guid, caps,...
state
task_struct
stack
usage
Process structures < 2.6.29
...uid, guid, caps,...
state
task_struct
stack
usage
…
*task_struct
thread_info
*exec_domain
Kernel Stack
thread_info
thread_union
Kernel Shellcode
●On < 2.6.29 raising process credentials is a matter of finding uid, gid and caps in task_struct
●And patching them to 0●Luckily %gs in kernel mode contains offset to
x8664_pda (/include/asm-x86/pda.h)/* Per processor datastructure. %gs points to it while the kernel runs */ struct x8664_pda { struct task_struct *pcurrent; /* 0 Current process */ unsigned long data_offset; /* 8 Per cpu data offset from linker address */ unsigned long kernelstack; /* 16 top of kernel stack for current */ unsigned long oldrsp; /* 24 user rsp for system call */ int irqcount; /* 32 Irq nesting counter. Starts with -1 */ int cpunumber; /* 36 Logical CPU number */#ifdef CONFIG_CC_STACKPROTECTOR unsigned long stack_canary;...
Kernel Shellcode
●%gs:0 will point to task_struct●So we can simply:
asm("movq %%gs:0, %0" : "=r"(ptr));
cred = (uint32_t *)ptr;
for (i = 0; i < 1000; i++, cred++) { if (cred[0] == uid && cred[1] == uid && cred[2] == uid && cred[3] == uid && cred[4] == gid && cred[5] == gid && cred[6] == gid && cred[7] == gid) { cred[0] = cred[1] = cred[2] = cred[3] = 0; cred[4] = cred[5] = cred[6] = cred[7] = 0;
●Where uid/gid are getuid() and getdid()●And our process will be root
Kernel Shellcode
●On > 2.6.29 x8664_pda is removed●And task_struct contains a new member called
cred (credential records)● If %rsp wasn’t modified we could walk back to top
of stack to find thread_info ●And do heuristic scanning to find thread_info-
>task_struct->creds->uid/gid●However with credential records come two new
functions●prepare_kernel_cred / commit_creds
Kernel Shellcode
●prepare_kernel_cred creates a new clean credentials structure
●commit_creds installs the new cred to the current task
●Both symbols are exported through /proc/kallsyms or /boot/System.map
●Kernel shellcode just needs tocommit_creds(prepare_kernel_cred(0));
●And we’re root again
Kernel Shellcode
●Next we will have to cleanly return back to userland
●Easiest method is to use IRET: __asm__ __volatile__( "movq %0, 0x20(%%rsp);" "movq %1, 0x18(%%rsp);" "movq %2, 0x10(%%rsp);" "movq %3, 0x08(%%rsp);" "movq %4, 0x00(%%rsp);" "swapgs;" "iretq;" :: "i"(USER_SS), "i"(user_stack), "i"(USER_FL), "i"(USER_CS), "i"(user_code) );
●Where user_code points to memory in userland that should be executed when kernel exits
Popping uid=0(root)
●user_code can do anything now since it runs as root
●So we can simply execve(/bin/sh) from there●However that happens inside the child so we have
to bring the rootshell back to the parent●Or we just chmod() or setxattr() to drop a root-
shell
Demo Time
Liminations
●These techniques work well with 2.6.18 - 3.9.X3.10 mitigates the IDT attack by remapping it to rodata (arch/x86/kernel/traps.c)__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);idt_descr.address = fix_to_virt(FIX_RO_IDT);
●CPUs with SMAP/SMEP will detect accessing userland code while still being in ring0
●Grsecurity will provide handful of protections to make this bug a pain to exploit○GRKERNSEC_RANDSTRUCT○ PAX_MEMORY_UDEREF○GRKERNSEC_HIDESYM○ ...
Further thoughts
●Linux fix is weird (“only” forces ptrace_stop() to use IRET)
●Syscalls can still return via SYSRET●Also bug within SYSRET is still present●Since it’s a hardware issue it might be present in
other OSes in different variations (OHAI 2006)●Any1 wanna check FreeBSD …?
Questions?