29
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Ivan Jibaja (Some slides adapted from Emery Berger’s presentation) 1

Hoard: A Scalable Memory Allocator for Multithreaded Applications

  • Upload
    tola

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Ivan Jibaja (Some slides adapted from Emery Berger’s presentation). Outline. Motivation Problems in allocator design False sharing Fragmentation - PowerPoint PPT Presentation

Citation preview

Page 1: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson

Presented by Ivan Jibaja

(Some slides adapted from Emery Berger’s presentation)

1

Page 2: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Outline

• Motivation• Problems in allocator design

– False sharing– Fragmentation

• Existing approaches• Hoard design• Experimental evaluation

2

Page 3: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Motivation

• Parallel multithreaded programs prevalent– Web servers, search engines, DB managers etc.– Run on CMP/SMP for high performance

• Memory allocation is a bottleneck– Prevents scaling with number of processors

3

Page 4: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Desired allocator attributes on a multiprocessor system

• Speed– Competitive with uniprocessor allocators on 1 cpu

• Scalability– Performance linear with the number of processors

• Fragmentation (=max allocated / max in use)– High fragmentation poor data locality paging

• False sharing avoidance

4

Page 5: Hoard: A Scalable Memory Allocator for Multithreaded Applications

The problem of false sharing• Program causes false sharing

• Allocate number of objects in a cache line, pass objects to different threads

• Allocators cause false sharing!• Actively:

• malloc satisfies different thread requests from same cache line

• Passively:• free allows future malloc to produce false sharing

processor 1 processor 2x2 = malloc(s);x1 = malloc(s);

A cache line

thrash… thrash…

5

Page 6: Hoard: A Scalable Memory Allocator for Multithreaded Applications

The problem of fragmentation

• Blowup:– Increase in memory consumption when allocator

reclaims memory freed by program, but fails to use it for future requests

– Mainly a problem of concurrent allocators

– Unbounded (worst case) or bounded (O(P))

6

Page 7: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Example: Pure Private Heaps Allocator

• Pure private heaps:• one heap per processor.

• malloc gets memoryfrom the processor's heap or the system

• free puts memory on the processor's heap

• Avoids heap contention• Examples: STL, Cilk

x1= malloc(s)

free(x1) free(x2)

x3= malloc(s)

x2= malloc(s)

x4= malloc(s)

processor 1 processor 2

= allocated by heap 1

= free, on heap 2

7

Page 8: Hoard: A Scalable Memory Allocator for Multithreaded Applications

How to Break Pure Private Heaps: Fragmentation

• Pure private heaps:• memory consumption can

grow without bound!

• Producer-consumer:• processor 1 allocates• processor 2 frees• Memory always

unavailable to producer

free(x1)

x2= malloc(s)

free(x2)

x1= malloc(s)processor 1 processor 2

x3= malloc(s)

free(x3)

8

Page 9: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Example II: Private Heaps with Ownership

• free puts memory back on the originating processor's heap.

• Avoids unbounded memory consumption• Examples: ptmalloc,LKmalloc

x1= malloc(s)

free(x1)

free(x2)

x2= malloc(s)

processor 1 processor 2

9

Page 10: Hoard: A Scalable Memory Allocator for Multithreaded Applications

How to Break Private Heaps with Ownership:Fragmentation

• memory consumption can blowup by a factor of P.

• Round-robin producer-consumer:processor i allocatesprocessor i+1 frees

• Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks

free(x2)

free(x1)

free(x3)

x1= malloc(s)

x2= malloc(s)

x3=malloc(s)

processor 1 processor 2 processor 3

10

Page 11: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Existing approaches

11

Page 12: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Uniprocessor Allocators on Multiprocessors

• Fragmentation: Excellent– Very low for most programs [Wilson & Johnstone]

• Speed & Scalability: Poor– Heap contention

• A single lock protects the heap

• Can exacerbate false sharing– Different processors can share cache lines

12

Page 13: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Existing Multiprocessor Allocators• Speed:

• One concurrent heap (e.g., concurrent B-tree):

• O(log (#size-classes)) cost per memory operation• too many locks/atomic updates

Fast allocators use multiple heaps

• Scalability:• Allocator-induced false sharing

• Other bottlenecks (e.g. nextHeap global in Ptmalloc)

• Fragmentation:• P-fold increase or even unbounded

13

Page 14: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard as the solution

14

Page 15: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard Overview• P per-processor heaps & 1 global heap• Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of

same-sized objects (LIFO free-list)– Avoids false sharing by not carving up cache lines– Avoids heap contention – local heaps allocate & free

small blocks from their superblocks

• Avoids blowup by– Moving superblocks to global heap when fraction of

free memory exceeds some threshold15

Page 16: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Superblock management

Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)

f = ¼K = 0

• Multiple heaps Avoid actively induced false sharing

• Block coalescing Avoid passively induced false sharing

• Superblocks transferred are usually empty and transfer is infrequent

16

Page 17: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard pseudo-codemalloc(sz)1. If sz > S/2, allocate the superblock from the OS

and return it.2. i hash(current thread)3. Lock heap i4. Scan heap i’s list of superblocks from full to least

(for the size class of sz)5. If there is no superblock with free space {6. Check heap 0 (global) for a superblock7. If there is none {8. Allocate S bytes as superblock s & set

owner to heap i9. } Else {10. Transfer the superblock s to heap i11. u0 u0 – s.u; ui ui + s.u

12. a0 a0 - S; ai ai + S

13. }14. }15. ui ui + sz; s.u s.u + sz

16. Unlock heap i17. Return a block from the superblock

free(ptr)1. If the block is “large”2. Free superblock to OS and return3. Find the superblock s this blocks comes from4. Lock s5. Lock heap i, the superblock’s owner6. Deallocate the block from the superblock7. ui ui – block size

8. s.u s.u – block size9. If (i = 0) unlock heap i, superblock s and return10. If (ui < ai – K*S) and (ui<(1-f)*ai) {

11. Transfer a mostly-empty superblock s1 to heap 0 (global)

12. u0 u0 + s1.u; ui ui – s1.u

13. a0 a0 + S; ai ai – S

14. } 15. Unlock heap i and superblock s

17

Page 18: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Heap contention

• Per-processor Heap contention

– 1 allocator thread / multiple threads free• Inherently unscalable

– Pairs of producer/consumer threads• malloc/free calls serialized• At most 2X slowdown (undesirable but scalable)

– Empirically only a small fraction of memory is freed by another thread Contention expected to be low

18

Page 19: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Heap contention (2)• Global Heap contention

– Measure # GH lock acquisitions as upper bound

– Growing phase:• Each thread at most k/(f*S/s) acquisitions for k malloc’s

– Shrinking phase:• Pathological case where program frees (1-f) of each superblock and

then frees every block in superblock one at a time

– Empirically: No excessive shrinking and gradual growth of memory usage low overall contention

19

Page 20: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Experimental Evaluation• Dedicated 14-processor Sun Enterprise

– 400 MHz Ultrasparc– 2 GB RAM, 4MB L2 cache– Solaris 7– Superblock size=8K, f = ¼

• Comparison between– Hoard– Ptmalloc (GNU libC, multiple heaps & ownership)– Mtmalloc (Solaris multithreaded allocator)– Solaris (default system allocator)

20

Page 21: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Benchmarks

21

Page 22: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Speed

22

Size classes need to be handled more cleverly

Page 23: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Scalability - threadtest

23

278% faster than Ptmalloc on 14 cpus

t threads allocate/deallocate 100,000/t 8-byte objects

Page 24: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Scalability – Larson

24

• “Bleeding” typical in server applications• Mainly stays within empty fraction during execution• 18X faster than next best allocator on 14 cpus

Page 25: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Scalability - BEMengine

25• Few times below empty fraction low synchronization

Page 26: Hoard: A Scalable Memory Allocator for Multithreaded Applications

False sharing behavior

26

• Active-false: Each thread allocates small object, writes it few times, frees it

• Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false

• Illustrate effects of contention of the coherence mechanism

Page 27: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Fragmentation results

27

Large number of size classes remain live for

duration of program and scattered across

blocks

Within 20% of Lea’s allocator

Page 28: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Hoard Conclusions• Speed: Excellent

• As fast as a uniprocessor allocator on one processor• amortized O(1) cost• 1 lock for malloc, 2 for free

• Scalability: Excellent• Scales linearly with the number of processors• Avoids false sharing

• Fragmentation: Very good• Worst-case is provably close to ideal• Actual observed fragmentation is low

28

Page 29: Hoard: A Scalable Memory Allocator for Multithreaded Applications

Discussion Points

• If we had to re-evaluate Hoard today which benchmarks would we use?

• Are there any changes needed to make it work with languages like Java?

29