Umbra: Efficient and Scalable Memory Shadowing

Qin Zhao (MIT)Derek Bruening (VMware)Saman Amarasinghe (MIT)

Umbra: Efficient and Scalable Memory Shadowing

CGO 2010, Toronto, CanadaApril 26, 2010

Shadow Memory

• Meta-data– Track properties of application memory

• Synchronized Update– Application data and meta-data

CGO, Toronto, Canada, 4/26/2010 2

a.outa.out

stack stack

libc libc

Application Memory

Shadow Memory

heap heap

Examples

• Memory Error Detection– MemCheck [VEE’07]– Purify [USENIX’92]– Dr. Memory– MemTracker [HPCA’07]

• Dynamic Information Flow Tracking – LIFT [MICRO’39]– TaintTrace [ISCC’06]

• Multi-threaded Debugging– Eraser [TCS’97]– Helgrind

• Others– Redux [TCS’03]– Software Watchpoint [CC’08]


Issues

• Performance– Runtime overhead

• Example: MemCheck 25x [VEE’07]

• Scalability– 64-bit architecture

• Dependence– OS– Hardware

• Development– Implemented with specific analysis– Lack of a general framework


Memory Shadowing System

• Dynamic Instrumentation– Context switch (application ↔ shadow)– Address calculation– Updating meta-data

• Memory Management– Memory allocation / free

• Monitor application memory management• Manage shadow memory

– Mapping translation scheme (addrA addrS)• DMS: Direct Mapping Scheme• SMS: Segmented Mapping Scheme


Direct Mapping Scheme (DMS)• Single memory region for entire address space.• Translation:• Issue: address conflict between memA and memS


dispaddraddr AS

lea [addr] %r1add %r1 disp %r1

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.80

2.40

4.67

Slowdown relative to

native execution

Application

Shadow

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.80

2.40

4.67


native execution

Segmented Mapping Scheme (SMS)• Shadow segment per application segment• Translation:

– Segment lookup (address indexing)– Address translation


lea [addr] %r1mov %r1 %r2shr %r2, 16 %r2add %r1, disp[%r2] %r1

segAS dispaddraddr

addrA

addrS

App 1

Shd 1

Shd 2

App 2

Segment table

Umbra

• Mapping Scheme– Segmented mapping– Scale with actual memory usage

• Implementation– DynamoRIO

• Optimization– Translation optimization– Instrumentation optimization

• Client API• Experimental Results

– Performance evaluation– Statistics collection


Kernel space

Shadow Memory Mapping

• Scaling to 64-bit Architecture– DMS

• Infeasible due to memory layout


a.out

Unusable space

stack

User space

vsyscall

247

264

CGO, Toronto, Canada, 4/26/2010



• Infeasible due to memory layout– Single-Level SMS

• Too big (~4 billion entries)


addrA



• Infeasible due to memory layout– Single-Level SMS

• Too big (~4 billion entries)– Multi-Level SMS

• Even more expensive • Fast path on lower 32G (MemCheck)

CGO, Toronto, Canada, 4/26/2010 11DMS-32 SMS-32 DMS-64 SMS-64

0

1

2

3

4

5

1.80

2.40

4.67


native execution

addrA


• Scaling to 64-bit Architecture– DMS is infeasible – Single-Level SMS is too sparse– Multi-Level SMS is too expensive

• Umbra Solution– Eliminate empty entries– Compact table– Walk the table to find the entry


Umbra

• Mapping Scheme √– Segmented mapping– Scale with actual memory usage

• Implementation– DynamoRIO


• Client API• Experimental Result



Implementation

• Memory Manager– Monitor and control application memory allocation

• brk, mmap, munmap, mremap– Allocate shadow memory– Maintain translation table

• Instrumenter– Instrument every memory reference

• Context save• Address calculation• Address translation• Shadow memory update• Context restore


App 1

Shd 1

Shd 2

App 2

Umbra


• Implementation √– DynamoRIO





~100

Unoptimized System

• Small overhead from DynamoRIO• Slower than SMS-64

– Need to walk the global translation table

• Why so slow?– 41.79% instructions are memory references– For each of these instructions

• Full context switch• Table lookup• Call-out instrumentation

16

Global translation

table

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

Table

Mem

oiz

ati

on C

...

Refe

rence

Cach

e

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

Optimization

• Translation Optimization– Thread-local translation cache– Hashtable lookup– Memoization mini-cache– Reference uni-cache

• Instrumentation Optimization– Context switch reduction– Reference grouping– 3-stage code layout

1717

Global translation

table

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

Table

Mem

oiz

ati

on C

...

Refe

rence

Cach

e

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

~100

~100

1. Thread-Local Translation Cache

• Local translation table per thread– Synchronize with global translation

table when necessary– Avoid lock contention– Walk table to find match entry

• Walk global table if not find in thread-local cache

• Inlined instrumentation

18

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

Table

Mem

oiz

ati

on C

...

Refe

rence

Cach

e

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

~100

2. Hashtable Lookup

• Hashtable per thread• Fixed number of slots• Hash(addra) entry

in thread-local cache– If match, found – If no match, walk the local cache

19

Thread 1

Thread 2

Global translation

table


cache

Hashtable

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

Table

Mem

oiz

ati

on C

...

Refe

rence

Cach

e

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

~100

3. Memoization Mini-Cache

• Four-entry table per thread– Stack– Heap– Application (a.out)– Units found in last table lookup

• If not match, hashtable lookup– 68.93% hit ratio

20

Thread 1

Thread 2

Global translation

table


cache

Memoization mini-cache

Hashtable

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

table

Mem

oiz

ati

on M

in...

Refe

rence

Uni-

C...

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

~100

4. Reference Uni-Cache

• Software uni-cache per instr per thread– Last reference unit tag– Last translation displacement

• If not match, memoization mini-cache check– 99.93% hit ratio

21

Reference uni-cache

Thread 1

Thread 2

Global translation

table


cache


Hashtable

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBXSM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

table

Mem

oiz

ati

on M

in...

Refe

rence

Uni-

C...

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

5. Context Switch Reduction

• Register liveness analysis– Use dead register– Avoid flags save/restore

22

Thread 1

Thread 2

Global translation

table


cache


Hashtable

~100

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

table

Mem

oiz

ati

on M

in...

Refe

rence

Uni-

C...

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006

Memory Reference 41.79%

Eflag Steal 2.55%

Register Steal 8.20%

6. Reference Grouping

• One reference cache for multiple references– Stack local variables– Different members of the same

object

23

Thread 1

Thread 2

Global translation

table


cache


Hashtable

~100

SM

S-6

4

Dynam

oR

IO

Unopti

miz

ed

Loca

l Tra

nsl

ati

on...

Hash

table

Mem

oiz

ati

on M

in...

Refe

rence

Uni-

C...

Conte

xt

Sw

itch

R...

Refe

rence

Gro

u...0

2468

101214161820

4.7

1.1

100.0

15.8 15.2

12.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006

Memory Reference 41.79%

Ref Uni-Cache Checks 22.76%

3-stage Code Layout

• Inline stub (<10 instructions)– Quick inline check code with minimal context switch

• Lean procedure (~50 instructions)– Simple assembly procedure with partial context switch

• Callout (C function)– C function with complete context switch


uni-cache checkmemoization check

hashtable lookup

local cache lookup

<full context switch>c_function() { // global table // lookup . . . . . .}<full context switch>

app instruction

Inline stub Lean procedure Callout

Umbra



• Optimization √– Translation optimization– Instrumentation optimization




Client API

Event Hooks Description

client_init Process initialization

client_exit Process exit

client_thread_init Thread initialization

client_thread_exit Thread exit

shadow_memory_create Shadow memory creation

shadow_memory_delete Shadow memory deletion

instrument_update Insert meta-data update code


Umbra Client: Shared Memory Detection

static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}

27CGO, Toronto, Canada, 4/26/2010

• Meta-data maintains a bit map to store which threads access the associated memory

Umbra



• Optimization √– Translation optimization– Instrumentation optimization

• Client API √• Experimental Result



Performance Evaluation



native execution

DMS-32 SMS-32 SMS-64 Umbra-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.80

2.40

4.67

2.49

EMS64:Efficient Memory Shadowing for 64-bit

• Translation– – Reference uni-cache hit rate: 99.93%– Still need a costly check to catch the 0.07%

• Reg steal; save flags; compare & jump; restore

• EMS64 (ISMM’10)– Speculatively use a disp without check– Notified by memory access violation fault for incorrect

disp

disprcaddraddr AS .


EMS64 Preliminary ResultSlowdown relative to

native execution


DMS-32 SMS-32 SMS-64 Umbra-64 EMS-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.80

2.40

4.67

2.49

1.81

Thanks

• Download– http://people.csail.mit.edu/qin_zhao/umbra/

• Q & A


http://people.csail.mit.edu/qin_zhao/umbra/

Documents

Umbra: Efficient and Scalable Memory Shadowing