32
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Umbra: Efficient and Scalable Memory Shadowing CGO 2010, Toronto, Canada April 26, 2010

Umbra: Efficient and Scalable Memory Shadowing

  • Upload
    darena

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Umbra: Efficient and Scalable Memory Shadowing. Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT). CGO 2010, Toronto, Canada April 26 , 2010. Shadow Memory. Meta-data Track properties of application memory Synchronized Update Application data and meta-data. a.out. a.out. - PowerPoint PPT Presentation

Citation preview

Page 1: Umbra: Efficient and Scalable Memory Shadowing

Qin Zhao (MIT)Derek Bruening (VMware)Saman Amarasinghe (MIT)

Umbra: Efficient and Scalable Memory Shadowing

CGO 2010, Toronto, CanadaApril 26, 2010

Page 2: Umbra: Efficient and Scalable Memory Shadowing

Shadow Memory• Meta-data

– Track properties of application memory• Synchronized Update

– Application data and meta-data

CGO, Toronto, Canada, 4/26/2010 2

a.outa.out

stack stack

libc libc

Application Memory

Shadow Memory

heap heap

Page 3: Umbra: Efficient and Scalable Memory Shadowing

Examples• Memory Error Detection

– MemCheck [VEE’07]– Purify [USENIX’92]– Dr. Memory– MemTracker [HPCA’07]

• Dynamic Information Flow Tracking – LIFT [MICRO’39]– TaintTrace [ISCC’06]

• Multi-threaded Debugging– Eraser [TCS’97]– Helgrind

• Others– Redux [TCS’03]– Software Watchpoint [CC’08]

CGO, Toronto, Canada, 4/26/2010 3

Page 4: Umbra: Efficient and Scalable Memory Shadowing

Issues• Performance

– Runtime overhead• Example: MemCheck 25x [VEE’07]

• Scalability– 64-bit architecture

• Dependence– OS– Hardware

• Development– Implemented with specific analysis– Lack of a general framework

CGO, Toronto, Canada, 4/26/2010 4

Page 5: Umbra: Efficient and Scalable Memory Shadowing

Memory Shadowing System• Dynamic Instrumentation

– Context switch (application ↔ shadow)– Address calculation– Updating meta-data

• Memory Management– Memory allocation / free

• Monitor application memory management• Manage shadow memory

– Mapping translation scheme (addrA addrS)• DMS: Direct Mapping Scheme• SMS: Segmented Mapping Scheme

CGO, Toronto, Canada, 4/26/2010 5

Page 6: Umbra: Efficient and Scalable Memory Shadowing

Direct Mapping Scheme (DMS)• Single memory region for entire address space.• Translation:• Issue: address conflict between memA and memS

CGO, Toronto, Canada, 4/26/2010 6

dispaddraddr AS

lea [addr] %r1add %r1 disp %r1

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.802.40

4.67

Slow

down

rel

ativ

e to

na

tive

exe

cuti

on

Application

Shadow

Page 7: Umbra: Efficient and Scalable Memory Shadowing

DMS-32 SMS-32 DMS-64 SMS-640

1

2

3

4

5

1.802.40

4.67

Slow

down

rel

ativ

e to

na

tive

exe

cuti

onSegmented Mapping Scheme (SMS)

• Shadow segment per application segment• Translation:

– Segment lookup (address indexing)– Address translation

CGO, Toronto, Canada, 4/26/2010 7

lea [addr] %r1mov %r1 %r2shr %r2, 16 %r2add %r1, disp[%r2] %r1

segAS dispaddraddr

addrA

addrS

App 1

Shd 1

Shd 2

App 2Segment table

Page 8: Umbra: Efficient and Scalable Memory Shadowing

Umbra• Mapping Scheme

– Segmented mapping– Scale with actual memory usage

• Implementation– DynamoRIO

• Optimization– Translation optimization– Instrumentation optimization

• Client API• Experimental Results

– Performance evaluation– Statistics collection

CGO, Toronto, Canada, 4/26/2010 8

Page 9: Umbra: Efficient and Scalable Memory Shadowing

Kernel space

Shadow Memory Mapping• Scaling to 64-bit Architecture

– DMS• Infeasible due to memory layout

CGO, Toronto, Canada, 4/26/2010 9

a.out

Unusable space

stackUser space

vsyscall

247

264

CGO, Toronto, Canada, 4/26/2010

Page 10: Umbra: Efficient and Scalable Memory Shadowing

Shadow Memory Mapping• Scaling to 64-bit Architecture

– DMS• Infeasible due to memory layout

– Single-Level SMS• Too big (~4 billion entries)

CGO, Toronto, Canada, 4/26/2010 10

addrA

Page 11: Umbra: Efficient and Scalable Memory Shadowing

Shadow Memory Mapping• Scaling to 64-bit Architecture

– DMS• Infeasible due to memory layout

– Single-Level SMS• Too big (~4 billion entries)

– Multi-Level SMS• Even more expensive • Fast path on lower 32G (MemCheck)

CGO, Toronto, Canada, 4/26/2010 11DMS-32 SMS-32 DMS-64 SMS-64

0

1

2

3

4

5

1.802.40

4.67

Slow

down

relat

ive to

na

tive e

xecuti

on

addrA

Page 12: Umbra: Efficient and Scalable Memory Shadowing

Shadow Memory Mapping• Scaling to 64-bit Architecture

– DMS is infeasible – Single-Level SMS is too sparse– Multi-Level SMS is too expensive

• Umbra Solution– Eliminate empty entries– Compact table– Walk the table to find the entry

CGO, Toronto, Canada, 4/26/2010 12

Page 13: Umbra: Efficient and Scalable Memory Shadowing

Umbra• Mapping Scheme √

– Segmented mapping– Scale with actual memory usage

• Implementation– DynamoRIO

• Optimization– Translation optimization– Instrumentation optimization

• Client API• Experimental Result

– Performance evaluation– Statistics collection

CGO, Toronto, Canada, 4/26/2010 13

Page 14: Umbra: Efficient and Scalable Memory Shadowing

Implementation• Memory Manager

– Monitor and control application memory allocation• brk, mmap, munmap, mremap

– Allocate shadow memory– Maintain translation table

• Instrumenter– Instrument every memory reference

• Context save• Address calculation• Address translation• Shadow memory update• Context restore

CGO, Toronto, Canada, 4/26/2010 14

App 1

Shd 1

Shd 2

App 2

Page 15: Umbra: Efficient and Scalable Memory Shadowing

Umbra• Mapping Scheme √

– Segmented mapping– Scale with actual memory usage

• Implementation √– DynamoRIO

• Optimization– Translation optimization– Instrumentation optimization

• Client API• Experimental Result

– Performance evaluation– Statistics collection

CGO, Toronto, Canada, 4/26/2010 15

Page 16: Umbra: Efficient and Scalable Memory Shadowing

~100

Unoptimized System• Small overhead from DynamoRIO• Slower than SMS-64

– Need to walk the global translation table

• Why so slow?– 41.79% instructions are memory references– For each of these instructions

• Full context switch• Table lookup• Call-out instrumentation

16

Global translation

table

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Page 17: Umbra: Efficient and Scalable Memory Shadowing

Optimization• Translation Optimization

– Thread-local translation cache– Hashtable lookup– Memoization mini-cache– Reference uni-cache

• Instrumentation Optimization– Context switch reduction– Reference grouping– 3-stage code layout

1717

Global translation

table

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

~100

Page 18: Umbra: Efficient and Scalable Memory Shadowing

~100

1. Thread-Local Translation Cache• Local translation table per

thread– Synchronize with global translation

table when necessary– Avoid lock contention– Walk table to find match entry

• Walk global table if not find in thread-local cache

• Inlined instrumentation

18

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Page 19: Umbra: Efficient and Scalable Memory Shadowing

~100

2. Hashtable Lookup• Hashtable per thread• Fixed number of slots• Hash(addra) entry

in thread-local cache– If match, found – If no match, walk the local cache

19

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

Hashtable

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

Tab

le

Mem

oiza

tion

C...

Refe

renc

e Ca

che

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Page 20: Umbra: Efficient and Scalable Memory Shadowing

~100

3. Memoization Mini-Cache• Four-entry table per thread

– Stack– Heap– Application (a.out)– Units found in last table lookup

• If not match, hashtable lookup– 68.93% hit ratio

20

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

Memoization mini-cache

Hashtable

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Page 21: Umbra: Efficient and Scalable Memory Shadowing

~100

4. Reference Uni-Cache• Software uni-cache per instr

per thread– Last reference unit tag– Last translation displacement

• If not match, memoization mini-cache check– 99.93% hit ratio

21

Reference uni-cache

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

Memoization mini-cache

Hashtable

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBXSM

S-64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Page 22: Umbra: Efficient and Scalable Memory Shadowing

5. Context Switch Reduction• Register liveness analysis

– Use dead register– Avoid flags save/restore

22

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

Memoization mini-cache

Hashtable

~100

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006Memory Reference 41.79%Eflag Steal 2.55%Register Steal 8.20%

Page 23: Umbra: Efficient and Scalable Memory Shadowing

6. Reference Grouping• One reference cache for

multiple references– Stack local variables– Different members of the same

object

23

Thread 1

Thread 2

Global translation

table

Thread-local translation

cache

Memoization mini-cache

Hashtable

~100

SMS-

64

Dyna

moR

IO

Unop

timize

d

Loca

l Tra

nsla

tion.

..

Hash

tabl

e

Mem

oiza

tion

Min.

..

Refe

renc

e Un

i-C...

Cont

ext S

witc

h R.

..

Refe

renc

e Gr

ou...

02468

101214161820

4.7

1.1

100.0

15.8 15.212.0

8.3

3.1 2.5

Reference uni-cache

ADD $1, (%RAX)

MOV %RBX 48(%RAX)

PUSH %RAX

ADD 40(%RAX), %RBX

#/#Instr SPEC2006Memory Reference 41.79%Ref Uni-Cache Checks 22.76%

Page 24: Umbra: Efficient and Scalable Memory Shadowing

3-stage Code Layout• Inline stub (<10 instructions)

– Quick inline check code with minimal context switch• Lean procedure (~50 instructions)

– Simple assembly procedure with partial context switch• Callout (C function)

– C function with complete context switch

CGO, Toronto, Canada, 4/26/2010 24

uni-cache check memoization check

hashtable lookup

local cache lookup

<full context switch>c_function() { // global table // lookup . . . . . .}<full context switch>

app instruction

Inline stub Lean procedure Callout

Page 25: Umbra: Efficient and Scalable Memory Shadowing

Umbra• Mapping Scheme √

– Segmented mapping– Scale with actual memory usage

• Implementation √– DynamoRIO

• Optimization √– Translation optimization– Instrumentation optimization

• Client API• Experimental Result

– Performance evaluation– Statistics collection

CGO, Toronto, Canada, 4/26/2010 25

Page 26: Umbra: Efficient and Scalable Memory Shadowing

Client API

Event Hooks Descriptionclient_init Process initializationclient_exit Process exitclient_thread_init Thread initializationclient_thread_exit Thread exitshadow_memory_create Shadow memory creationshadow_memory_delete Shadow memory deletioninstrument_update Insert meta-data update code

CGO, Toronto, Canada, 4/26/2010 26

Page 27: Umbra: Efficient and Scalable Memory Shadowing

Umbra Client: Shared Memory Detection

static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}

27CGO, Toronto, Canada, 4/26/2010

• Meta-data maintains a bit map to store which threads access the associated memory

Page 28: Umbra: Efficient and Scalable Memory Shadowing

Umbra• Mapping Scheme √

– Segmented mapping– Scale with actual memory usage

• Implementation √– DynamoRIO

• Optimization √– Translation optimization– Instrumentation optimization

• Client API √• Experimental Result

– Performance evaluation– Statistics collection

CGO, Toronto, Canada, 4/26/2010 28

Page 29: Umbra: Efficient and Scalable Memory Shadowing

Performance Evaluation

CGO, Toronto, Canada, 4/26/2010 29

Slowdown relative to

native execution

DMS-32 SMS-32 SMS-64 Umbra-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.802.40

4.67

2.49

Page 30: Umbra: Efficient and Scalable Memory Shadowing

EMS64:Efficient Memory Shadowing for 64-bit

• Translation– – Reference uni-cache hit rate: 99.93%– Still need a costly check to catch the 0.07%

• Reg steal; save flags; compare & jump; restore

• EMS64 (ISMM’10)– Speculatively use a disp without check– Notified by memory access violation fault for incorrect

disp

disprcaddraddr AS .

CGO, Toronto, Canada, 4/26/2010 30

Page 31: Umbra: Efficient and Scalable Memory Shadowing

EMS64 Preliminary ResultSlowdown relative to

native execution

CGO, Toronto, Canada, 4/26/2010 31

DMS-32 SMS-32 SMS-64 Umbra-64 EMS-640.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1.80

2.40

4.67

2.49

1.81

Page 32: Umbra: Efficient and Scalable Memory Shadowing

Thanks• Download

– http://people.csail.mit.edu/qin_zhao/umbra/

• Q & A

CGO, Toronto, Canada, 4/26/2010 32