Umbra: Efficient and Scalable Memory Shadowing

Qin Zhao (MIT)Derek Bruening (VMware)Saman Amarasinghe (MIT)

Umbra: Efficient and Scalable Memory Shadowing

CGO 2010, Toronto, CanadaApril 26, 2010

Shadow Memory• Meta-data

– Track properties of application memory• Synchronized Update

– Application data and meta-data

CGO, Toronto, Canada, 4/26/2010 2

a.outa.out

stack stack

libc libc

Application Memory

Shadow Memory

heap heap

Examples• Memory Error Detection

– MemCheck [VEE’07]– Purify [USENIX’92]– Dr. Memory– MemTracker [HPCA’07]

• Dynamic Information Flow Tracking – LIFT [MICRO’39]– TaintTrace [ISCC’06]

• Multi-threaded Debugging– Eraser [TCS’97]– Helgrind

• Others– Redux [TCS’03]– Software Watchpoint [CC’08]


Issues• Performance

– Runtime overhead• Example: MemCheck 25x [VEE’07]

• Scalability– 64-bit architecture

• Dependence– OS– Hardware

• Development– Implemented with specific analysis– Lack of a general framework


Memory Shadowing System• Dynamic Instrumentation

– Context switch (application ↔ shadow)– Address calculation– Updating meta-data

• Memory Management– Memory allocation / free

• Monitor application memory management• Manage shadow memory

– Mapping translation scheme (addrA addrS)• DMS: Direct Mapping Scheme• SMS: Segmented Mapping Scheme


Direct Mapping Scheme (DMS)• Single memory region for entire address space.• Translation:• Issue: address conflict between memA and memS


dispaddraddr AS

lea [addr] %r1add %r1 disp %r1

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.802.40

4.67

Slow

down

rel

ativ

e to

na

tive

exe

cuti

on

Application

Shadow

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.802.40

4.67

Slow

down

rel

ativ

e to

na

tive

exe

cuti

onSegmented Mapping Scheme (SMS)

• Shadow segment per application segment• Translation:

– Segment lookup (address indexing)– Address translation


lea [addr] %r1mov %r1 %r2shr %r2, 16 %r2add %r1, disp[%r2] %r1

segAS dispaddraddr

addrA

addrS

App 1

Shd 1

Shd 2

App 2Segment table

Umbra• Mapping Scheme

– Segmented mapping– Scale with actual memory usage

• Implementation– DynamoRIO

• Optimization– Translation optimization– Instrumentation optimization

• Client API• Experimental Results

– Performance evaluation– Statistics collection


Kernel space

Shadow Memory Mapping• Scaling to 64-bit Architecture

– DMS• Infeasible due to memory layout


a.out

Unusable space

stackUser space

vsyscall

247

264

CGO, Toronto, Canada, 4/26/2010



– Single-Level SMS• Too big (~4 billion entries)


addrA



– Single-Level SMS• Too big (~4 billion entries)

– Multi-Level SMS• Even more expensive • Fast path on lower 32G (MemCheck)

CGO, Toronto, Canada, 4/26/2010 11DMS-32 SMS-32 DMS-64 SMS-64

0

1

2

3

4

5

1.802.40

4.67

Slow

down

relat

ive to

na

tive e

xecuti

on

addrA


– DMS is infeasible – Single-Level SMS is too sparse– Multi-Level SMS is too expensive

• Umbra Solution– Eliminate empty entries– Compact table– Walk the table to find the entry


Umbra• Mapping Scheme √


• Implementation– DynamoRIO


• Client API• Experimental Result



Implementation• Memory Manager

– Monitor and control application memory allocation• brk, mmap, munmap, mremap

– Allocate shadow memory– Maintain translation table

• Instrumenter– Instrument every memory reference

• Context save• Address calculation• Address translation• Shadow memory update• Context restore


App 1

Shd 1

Shd 2

App 2



• Implementation √– DynamoRIO





~100

Unoptimized System• Small overhead from DynamoRIO• Slower than SMS-64

– Need to walk the global translation table

• Why so slow?– 41.79% instructions are memory references– For each of these instructions

• Full context switch• Table lookup• Call-out instrumentation

16

Global translation

table

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Optimization• Translation Optimization

– Thread-local translation cache– Hashtable lookup– Memoization mini-cache– Reference uni-cache

• Instrumentation Optimization– Context switch reduction– Reference grouping– 3-stage code layout

1717

Global translation

table

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

~100

~100

1. Thread-Local Translation Cache• Local translation table per

thread– Synchronize with global translation

table when necessary– Avoid lock contention– Walk table to find match entry

• Walk global table if not find in thread-local cache

• Inlined instrumentation

18

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

~100

2. Hashtable Lookup• Hashtable per thread• Fixed number of slots• Hash(addra) entry

in thread-local cache– If match, found – If no match, walk the local cache

19

Thread 1

Thread 2

Global translation

table


cache

Hashtable

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

~100

3. Memoization Mini-Cache• Four-entry table per thread

– Stack– Heap– Application (a.out)– Units found in last table lookup

• If not match, hashtable lookup– 68.93% hit ratio

20

Thread 1

Thread 2

Global translation

table


cache

Memoization mini-cache

Hashtable

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

~100

4. Reference Uni-Cache• Software uni-cache per instr

per thread– Last reference unit tag– Last translation displacement

• If not match, memoization mini-cache check– 99.93% hit ratio

21

Reference uni-cache

Thread 1

Thread 2

Global translation

table


cache


Hashtable

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBXSM

S-64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

5. Context Switch Reduction• Register liveness analysis

– Use dead register– Avoid flags save/restore

22

Thread 1

Thread 2

Global translation

table


cache


Hashtable

~100

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006Memory Reference 41.79%Eflag Steal 2.55%Register Steal 8.20%

6. Reference Grouping• One reference cache for

multiple references– Stack local variables– Different members of the same

object

23

Thread 1

Thread 2

Global translation

table


cache


Hashtable

~100

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006Memory Reference 41.79%Ref Uni-Cache Checks 22.76%

3-stage Code Layout• Inline stub (<10 instructions)

– Quick inline check code with minimal context switch• Lean procedure (~50 instructions)

– Simple assembly procedure with partial context switch• Callout (C function)

– C function with complete context switch


uni-cache check memoization check

hashtable lookup

local cache lookup

<full context switch>c_function() { // global table // lookup . . . . . .}<full context switch>

app instruction

Inline stub Lean procedure Callout




• Optimization √– Translation optimization– Instrumentation optimization




Client API

Event Hooks Descriptionclient_init Process initializationclient_exit Process exitclient_thread_init Thread initializationclient_thread_exit Thread exitshadow_memory_create Shadow memory creationshadow_memory_delete Shadow memory deletioninstrument_update Insert meta-data update code


Umbra Client: Shared Memory Detection

static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}

27CGO, Toronto, Canada, 4/26/2010

• Meta-data maintains a bit map to store which threads access the associated memory




• Optimization √– Translation optimization– Instrumentation optimization

• Client API √• Experimental Result



Performance Evaluation


Slowdown relative to

native execution

DMS-32 SMS-32 SMS-64 Umbra-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.802.40

4.67

2.49

EMS64:Efficient Memory Shadowing for 64-bit

• Translation– – Reference uni-cache hit rate: 99.93%– Still need a costly check to catch the 0.07%

• Reg steal; save flags; compare & jump; restore

• EMS64 (ISMM’10)– Speculatively use a disp without check– Notified by memory access violation fault for incorrect

disp

disprcaddraddr AS .


EMS64 Preliminary ResultSlowdown relative to

native execution


DMS-32 SMS-32 SMS-64 Umbra-64 EMS-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.80

2.40

4.67

2.49

1.81

Thanks• Download

– http://people.csail.mit.edu/qin_zhao/umbra/

• Q & A


http://people.csail.mit.edu/qin_zhao/umbra/

Documents

Umbra: Efficient and Scalable Memory Shadowing