52
Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005 Presented by: Jeff Da Silva CARG - Aug 2 nd 2005

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

  • Upload
    aulani

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap. Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005. Presented by: Jeff Da Silva CARG - Aug 2 nd 2005. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Automatic Pool

Allocation:Improving Performance by

Controlling Data Structure Layout

in the HeapPaper by:

Chris Lattner and Vikram AdveUniversity of Illinois at Urbana-ChampaignBest Paper Award at PLDI 2005

Presented by: Jeff Da SilvaCARG - Aug 2nd 2005

Page 2: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

MotivationMotivation

Computer architecture and compiler research has Computer architecture and compiler research has primarily focused on analyzing & optimizing memory primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for access pattern for dense arrays rather than for pointer-based data structures.pointer-based data structures. e.g. caches, prefetching, loop transformations, etc.e.g. caches, prefetching, loop transformations, etc.

Why?Why? Compilers have precise knowledge of the runtime layout and Compilers have precise knowledge of the runtime layout and

traversal patterns associated with arrays.traversal patterns associated with arrays. The layout of heap allocated data structures and traversal The layout of heap allocated data structures and traversal

patterns can be difficult to statically predict (generally it’s not patterns can be difficult to statically predict (generally it’s not worth the effort).worth the effort).

Page 3: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

IntuitionIntuition Improving a dynamic data structure’s spatial locality Improving a dynamic data structure’s spatial locality

through program analysis (like shape analysis) is through program analysis (like shape analysis) is probably too difficult and might not be the best probably too difficult and might not be the best approach.approach.

What if you can somehow influence the layout What if you can somehow influence the layout dynamically so that the data structure is allocated dynamically so that the data structure is allocated intelligently and possibly appears like a dense intelligently and possibly appears like a dense array?array?

A new approach: develop a new technique that A new approach: develop a new technique that operates at the “macroscopic” leveloperates at the “macroscopic” level i.e. at the level of the entire data structure rather than individual i.e. at the level of the entire data structure rather than individual

pointers or objectspointers or objects

Page 4: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

What is the problem?What is the problem?List 1Nodes

What the program creates:What the program creates:

List 2Nodes

TreeNodes

Page 5: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

What is the problem?What is the problem?List 1Nodes

List 2Nodes

TreeNodes

What the compiler sees:What the compiler sees:

Page 6: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

What is the problem?What is the problem?List 1Nodes

List 2Nodes

TreeNodes

What we want the program to create and theWhat we want the program to create and thecompiler to see:compiler to see:

What the compiler sees:What the compiler sees:

Page 7: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Their Approach: Segregate the HeapTheir Approach: Segregate the Heap

Step #1: Memory Usage AnalysisStep #1: Memory Usage Analysis Build context-sensitive points-to graphs for programBuild context-sensitive points-to graphs for program We use a fast unification-based algorithmWe use a fast unification-based algorithm

Step #2: Automatic Pool AllocationStep #2: Automatic Pool Allocation Segregate memory based on points-to graph nodesSegregate memory based on points-to graph nodes Find lifetime bounds for memory with escape analysisFind lifetime bounds for memory with escape analysis Preserve points-to graph-to-pool mappingPreserve points-to graph-to-pool mapping

Step #3: Follow-on pool-specific optimizationsStep #3: Follow-on pool-specific optimizations Use segregation and points-to graph for later Use segregation and points-to graph for later

optimizationsoptimizations

Page 8: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Why Segregate Data Structures?Why Segregate Data Structures?

Primary Goal: Primary Goal: Better compiler information & controlBetter compiler information & control Compiler knows where each data structure lives in memoryCompiler knows where each data structure lives in memory Compiler knows order of data in memory (in some cases)Compiler knows order of data in memory (in some cases) Compiler knows type info for heap objects (from points-to info)Compiler knows type info for heap objects (from points-to info) Compiler knows which pools point to which other poolsCompiler knows which pools point to which other pools

Second Goal:Second Goal: Better performance Better performance Smaller working setsSmaller working sets Improved spatial localityImproved spatial locality

Especially if allocation order matches traversal orderEspecially if allocation order matches traversal order

Sometimes convert irregular strides to regular stridesSometimes convert irregular strides to regular strides

Page 9: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Contributions of this PaperContributions of this Paper

1.1. First “region inference” technique for C/C++:First “region inference” technique for C/C++: Previous work Previous work requiredrequired type-safe programs: ML, Java type-safe programs: ML, Java Previous work focused on memory managementPrevious work focused on memory management

2.2. Region inference driven by pointer analysis:Region inference driven by pointer analysis: Enables handling non-type-safe programsEnables handling non-type-safe programs Simplifies handling imperative programsSimplifies handling imperative programs Simplifies further pool+ptr transformationsSimplifies further pool+ptr transformations

3.3. New pool-based optimizations:New pool-based optimizations: Exploit per-pool and pool-specific propertiesExploit per-pool and pool-specific properties

4.4. Evaluation of impact on memory hierarchy:Evaluation of impact on memory hierarchy: We show that pool allocation reduces working setsWe show that pool allocation reduces working sets

Page 10: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

OutlineOutline

Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimization Performance Pool Allocation & Optimization Performance

ImpactImpact ConclusionConclusion

Page 11: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

ExampleExamplestruct list { list *Next; int *Data; };struct list { list *Next; int *Data; };

list *list *createnodecreatenode(int *Data) {(int *Data) {list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list));New->Data = Num; return New;New->Data = Num; return New;

return New;return New;}}

void void splitclonesplitclone(list *L, list **R1, list **R2) {(list *L, list **R1, list **R2) {if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {

*R1 = *R1 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);

} else {} else {*R2 = *R2 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);

}}}}

Page 12: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

ExampleExample

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

// Clone L, splitting nodes in list A, and B.// Clone L, splitting nodes in list A, and B.splitclonesplitclone(L, &A, &B);(L, &A, &B);processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list

// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; freefree(A); A = tmp; }(A); A = tmp; }

// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; freefree(B); B = tmp; }(B); B = tmp; }

}}Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.

Page 13: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Pool Alloc Runtime Library InterfacePool Alloc Runtime Library Interface

void void poolcreatepoolcreate(Pool *PD, uint Size, uint Align);(Pool *PD, uint Size, uint Align);Initialize a pool descriptor. (obtains one or more pages of memory using malloc)Initialize a pool descriptor. (obtains one or more pages of memory using malloc)

void void pooldestroypooldestroy(Pool *PD);(Pool *PD);Releases pool memory and destroy pool descriptor.Releases pool memory and destroy pool descriptor.

void *void *poolallocpoolalloc(Pool *PD, uint numBytes);(Pool *PD, uint numBytes);

void void poolfreepoolfree(Pool *PD, void *ptr);(Pool *PD, void *ptr);

void *void *poolreallocpoolrealloc(Pool *PD, void *ptr, uint numBytes);(Pool *PD, void *ptr, uint numBytes);

Interface also uses:Interface also uses:poolinit_bppoolinit_bp(..), (..), poolalloc_bppoolalloc_bp(..), (..), pooldestroy_bppooldestroy_bp(..).(..).

Page 14: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Algorithm StepsAlgorithm Steps

1)1) Generate a Generate a DS GraphDS Graph (Points-to Graphs) for each (Points-to Graphs) for each functionfunction

2)2) Insert code to Insert code to createcreate and and destroydestroy pool descriptors pool descriptors for DS nodes who’s lifetime does not escape a for DS nodes who’s lifetime does not escape a function.function.

3)3) Add pool descriptor Add pool descriptor argumentsarguments for every DS node for every DS node that escapes a function.that escapes a function.

4)4) Replace Replace mallocmalloc and and freefree with calls to with calls to poolallocpoolalloc and and poolfreepoolfree..

5)5) Further refinements and optimizationsFurther refinements and optimizations

Page 15: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Points-To Graph – DS GraphPoints-To Graph – DS Graph

Builds a points-to graph for each function in Builds a points-to graph for each function in Bottom Up [BU] order Bottom Up [BU] order

Context sensitive naming of heap objectsContext sensitive naming of heap objects More advanced than the traditional “allocation callsite” More advanced than the traditional “allocation callsite”

A unification-based approachA unification-based approach Allows for a fast and scalable analysisAllows for a fast and scalable analysis Ensures every pointer points to one unique nodeEnsures every pointer points to one unique node

Field SensitiveField Sensitive Added accuracy Added accuracy

Also used to compute escape infoAlso used to compute escape info

Page 16: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – DS GraphExample – DS Graph

list *list *createnodecreatenode(int *Data) {(int *Data) {list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list));New->Data = Num; return New;New->Data = Num; return New;

return New;return New;}}

Page 17: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – DS GraphExample – DS Graphvoid void splitclonesplitclone(list *L, list **R1, list **R2) {(list *L, list **R1, list **R2) {

if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {

*R1 = *R1 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);

} else {} else {*R2 = *R2 = createnodecreatenode(L->data);(L->data);splitclonesplitclone(L->Next, &(*R1)->Next);(L->Next, &(*R1)->Next);

}}}}

P1 P2

Page 18: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – DS GraphExample – DS Graphvoid void processlistprocesslist(list *L) {(list *L) {

list *A, *B, *tmp;list *A, *B, *tmp;// Clone L, splitting nodes in list A, and B.// Clone L, splitting nodes in list A, and B.splitclonesplitclone(L, &A, &B);(L, &A, &B);processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; freefree(A); A = tmp; }(A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; freefree(B); B = tmp; }(B); B = tmp; }

}}

P1 P2

Page 19: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – TransformationExample – Transformation

list *list *createnodecreatenode((Pool *PD,Pool *PD, int *Data) { int *Data) {

list *New = list *New = poolallocpoolalloc((PD,PD, sizeof(list)); sizeof(list));

New->Data = Num; return New;New->Data = Num; return New; return New;return New;

}}

Page 20: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – TransformationExample – Transformation

void void splitclonesplitclone((Pool *PD1, Pool *PD2,Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) {list *L, list **R1, list **R2) {

if(L==0) {*R1 = *R2 = 0 return;}if(L==0) {*R1 = *R2 = 0 return;}

if(some_predicate(L->Data)) {if(some_predicate(L->Data)) {

*R1 = *R1 = createnodecreatenode((PD1,PD1, L->data); L->data);splitclonesplitclone((PD1, PD2, PD1, PD2, L->Next, &(*R1)->Next);L->Next, &(*R1)->Next);

} else {} else {

*R2 = *R2 = createnodecreatenode((PD2PD2,, L->data); L->data);splitclonesplitclone((PD1, PD2, PD1, PD2, L->Next, &(*R1)->Next);L->Next, &(*R1)->Next);

}}}}

P1 P2

Page 21: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – TransformationExample – Transformationvoid void processlistprocesslist(list *L) {(list *L) {

list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }

pooldestroy(&PD1);pooldestroy(&PD1);pooldestroy(&PD2);pooldestroy(&PD2);

}}

P1 P2

Page 22: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

More Algorithm DetailsMore Algorithm Details

Indirect Function Call Handling:Indirect Function Call Handling: Partition functions into equivalence classes:Partition functions into equivalence classes:

If F1, F2 have If F1, F2 have common call-sitecommon call-site same class same class Merge points-to graphs for each equivalence classMerge points-to graphs for each equivalence class Apply previous transformation unchangedApply previous transformation unchanged

Global variables pointing to memory nodesGlobal variables pointing to memory nodes Use global pool variable rather than passing them Use global pool variable rather than passing them

around through function args.around through function args.

Page 23: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

More Algorithm DetailsMore Algorithm Details

poolcreate / pooldestroy placement:poolcreate / pooldestroy placement: Move calls earlier/later by analyzing the pool’s lifetimeMove calls earlier/later by analyzing the pool’s lifetime Reduces memory usageReduces memory usage Enables poolfree eliminationEnables poolfree elimination

poolfree eliminationpoolfree elimination Eliminate unnecessary poolfree callsEliminate unnecessary poolfree calls

No allocations between poolfree & pooldestroyNo allocations between poolfree & pooldestroy Behaves like static garbage collectionBehaves like static garbage collection

Page 24: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – Example – poolcreate/pooldestroy placementpoolcreate/pooldestroy placement

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }

pooldestroy(&PD1);pooldestroy(&PD1);pooldestroy(&PD2);pooldestroy(&PD2);

}}

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }

pooldestroy(&PD1);pooldestroy(&PD1);

// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }

pooldestroy(&PD2);pooldestroy(&PD2);}}

Page 25: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Example – Example – poolfree Eliminationpoolfree Elimination

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; poolfreepoolfree((&PD1&PD1, A); A = tmp; }, A); A = tmp; }

pooldestroy(&PD1);pooldestroy(&PD1);

// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; poolfreepoolfree((&PD2&PD2, B); B = tmp; }, B); B = tmp; }

pooldestroy(&PD2);pooldestroy(&PD2);}}

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list// free A list// free A listwhile (A) {tmp = A->Next; while (A) {tmp = A->Next; A = tmp; }A = tmp; }

pooldestroy(&PD1);pooldestroy(&PD1);

// free B list// free B listwhile (B) {tmp = B->Next; while (B) {tmp = B->Next; B = tmp; }B = tmp; }

pooldestroy(&PD2);pooldestroy(&PD2);}}

void void processlistprocesslist(list *L) {(list *L) {list *A, *B, *tmp;list *A, *B, *tmp;

Pool PD1, PD2;Pool PD1, PD2;poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD1, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);poolcreate(&PD2, sizeof(list), 8);

splitclonesplitclone((&PD1, &PD2,&PD1, &PD2, L, &A, &B); L, &A, &B);

processPortionprocessPortion(A); // Process first list(A); // Process first listprocessPortionprocessPortion(B); // Process second list(B); // Process second list

pooldestroy(&PD1);pooldestroy(&PD1);

pooldestroy(&PD2);pooldestroy(&PD2);}}

Page 26: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

OutlineOutline

Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimiztion Performance Pool Allocation & Optimiztion Performance

ImpactImpact ConclusionConclusion

Page 27: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

PAOpts (1/4) and (2/4)PAOpts (1/4) and (2/4)

Selective Pool AllocationSelective Pool Allocation Don’t pool allocate when not profitableDon’t pool allocate when not profitable Avoids creating and destroying a pool descriptor Avoids creating and destroying a pool descriptor

(minor) and avoids significant wasted space when the (minor) and avoids significant wasted space when the object is much smaller than the smallest internal object is much smaller than the smallest internal page.page.

PoolFree EliminationPoolFree Elimination Remove explicit de-allocations that are not neededRemove explicit de-allocations that are not needed

Page 28: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Looking closely: Anatomy of a heapLooking closely: Anatomy of a heap

Fully general malloc-compatible allocator:Fully general malloc-compatible allocator: Supports malloc/free/realloc/memalign etc.Supports malloc/free/realloc/memalign etc. Standard malloc overheads: object header, alignmentStandard malloc overheads: object header, alignment Allocates slabs of memory with exponential growthAllocates slabs of memory with exponential growth By default, all returned pointers are 8-byte alignedBy default, all returned pointers are 8-byte aligned

In memory, things look like (16 byte allocs):In memory, things look like (16 byte allocs):

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

One 32-byte Cache Line

4-byte object header4-byte padding for

user-data alignment

Page 29: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

PAOpts (3/4): Bump Pointer OptznPAOpts (3/4): Bump Pointer Optzn

If a pool has no poolfree’s:If a pool has no poolfree’s: Eliminate per-object headerEliminate per-object header Eliminate freelist overhead (faster object allocation)Eliminate freelist overhead (faster object allocation)

Eliminates 4 bytes of inter-object paddingEliminates 4 bytes of inter-object padding Pack objects more densely in the cachePack objects more densely in the cache

Interacts with poolfree elimination (PAOpt 2/4)!Interacts with poolfree elimination (PAOpt 2/4)! If poolfree elim deletes all frees, BumpPtr can applyIf poolfree elim deletes all frees, BumpPtr can apply

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

One 32-byte Cache Line

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

Page 30: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

PAOpts (4/4): Alignment AnalysisPAOpts (4/4): Alignment Analysis

Malloc must return 8-byte aligned memory:Malloc must return 8-byte aligned memory: It has no idea what types will be used in the memoryIt has no idea what types will be used in the memory Some machines bus error, others suffer performance Some machines bus error, others suffer performance

problems for unaligned memoryproblems for unaligned memory

Type-safe pools infer a type for the pool:Type-safe pools infer a type for the pool: Use 4-byte alignment for pools we know don’t need itUse 4-byte alignment for pools we know don’t need it Reduces inter-object paddingReduces inter-object padding

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

16-byte16-byteuser datauser data

One 32-byte Cache Line

4-byte object header

16-byte16-byteuser datauser data

Page 31: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

OutlineOutline

Introduction & MotivationIntroduction & Motivation Automatic Pool Allocation TransformationAutomatic Pool Allocation Transformation Pool Allocation-Based OptimizationsPool Allocation-Based Optimizations Pool Allocation & Optimization Performance Pool Allocation & Optimization Performance

ImpactImpact ConclusionConclusion

Page 32: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Implementation & InfrastructureImplementation & Infrastructure

Link-time transformation using the LLVM Link-time transformation using the LLVM Compiler InfrastructureCompiler Infrastructure

Uses LLVM-to-C back-end and the resulting Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3code is compiled with GCC 3.4.2 –O3

Evaluated on AMD Athlon MP 2100+Evaluated on AMD Athlon MP 2100+ 64KB L1, 256KB L264KB L1, 256KB L2

Page 33: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Simple Pool Allocation StatisticsSimple Pool Allocation Statistics

Programs from SPEC Programs from SPEC CINT2K, Ptrdist, CINT2K, Ptrdist,

FreeBench & Olden FreeBench & Olden suites, plus unbundled suites, plus unbundled

programsprograms 91

Table 1

Page 34: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Simple Pool Allocation StatisticsSimple Pool Allocation Statistics

DSA is able to infer DSA is able to infer that most static pools that most static pools are type-homogenousare type-homogenous

91

Table 1

Page 35: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Compile TimeCompile Time

Compilation overhead Compilation overhead is less than 3%is less than 3%

Table 3

Page 36: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me r

ati

o (

sm

aller

is f

aste

r)Pool Allocation SpeedupPool Allocation Speedup

Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programsSizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude fasterSome programs (ft, chomp) order of magnitude faster

Most programs are 0% Most programs are 0% to 20% faster with pool to 20% faster with pool

allocation aloneallocation alone

Page 37: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me r

ati

o (

sm

aller

is f

aste

r)Pool Allocation SpeedupPool Allocation Speedup

Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programsSizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude fasterSome programs (ft, chomp) order of magnitude faster

Two are 10x faster, one Two are 10x faster, one is almost 2x fasteris almost 2x faster

Page 38: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.2

0.4

0.6

0.8

1

1.2

1.4

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me R

ati

o (

1.0

= B

ase P

oo

l A

llo

cati

on

)

No Pool Allocation

All Optimizations

Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation

Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact

Most are 5-15% faster Most are 5-15% faster with optimizations than with optimizations than with Pool Alloc alonewith Pool Alloc alone

PA PA TimeTime

Figure 9(with different baseline)

Page 39: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.2

0.4

0.6

0.8

1

1.2

1.4

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me R

ati

o (

1.0

= B

ase P

oo

l A

llo

cati

on

)

No Pool Allocation

All Optimizations

Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation

Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact

One is 44% faster, other One is 44% faster, other is 29% fasteris 29% faster

PA PA TimeTime

Figure 9(with different baseline)

Page 40: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.2

0.4

0.6

0.8

1

1.2

1.4

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me R

ati

o (

1.0

= B

ase P

oo

l A

llo

cati

on

)

No Pool Allocation

All Optimizations

Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation

Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact

Pool optimizations help Pool optimizations help some progs that pool some progs that pool

allocation itself doesn’tallocation itself doesn’t

PA PA TimeTime

Figure 9(with different baseline)

Page 41: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.2

0.4

0.6

0.8

1

1.2

1.4

175.v

pr

197.p

ars

er-b

300.tw

olf

bc

ft analy

zer

llu-b

ench

chom

p

fpgro

wth

espre

sso

povra

y31

bis

ort

health

mst

perim

ete

r

tsp

Ru

nti

me R

ati

o (

1.0

= B

ase P

oo

l A

llo

cati

on

)

No Pool Allocation

All Optimizations

Pool Optimization Speedup (FullPA)Pool Optimization Speedup (FullPA)

Baseline 1.0 = Run Time with Pool AllocationBaseline 1.0 = Run Time with Pool Allocation

Optimizations help all of these programs:Optimizations help all of these programs: Despite being very simple, they make a big impactDespite being very simple, they make a big impact

Pool optzns effect can Pool optzns effect can be additive with the pool be additive with the pool

allocation effectallocation effect

PA PA TimeTime

Figure 9(with different baseline)

Page 42: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

0

0.2

0.4

0.6

0.8

1

1.2

175.vpr

197.parser-b

300.twolf

bc ft analyzer

llu-bench

chomp

fpgrowth

espresso

povray31

bisort

health

mst

perimeter

tsp

Ca

ch

e m

iss

ra

tio

L1 Misses

L2 Misses

TLB Misses

Cache/TLB miss reductionCache/TLB miss reduction

Sources:Sources: Defragmented heapDefragmented heap Reduced inter-object Reduced inter-object

paddingpadding Segregating the heap!Segregating the heap!

Miss rate measuredMiss rate measured

with perfctr on AMD with perfctr on AMD Athlon 2100+Athlon 2100+

Figure 10

Page 43: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Pool Optimization StatisticsPool Optimization Statistics

Table 2

Page 44: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Optimization ContributionOptimization Contribution

Figure 11

Page 45: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Pool Allocation ConclusionsPool Allocation Conclusions

1.1. Segregate heap based on points-to graphSegregate heap based on points-to graph Improved Memory Hierarchy PerformanceImproved Memory Hierarchy Performance Give compiler some control over layoutGive compiler some control over layout Give compiler information about localityGive compiler information about locality

2.2. Optimize pools based on per-pool propertiesOptimize pools based on per-pool properties Very simple (but useful) optimizations proposed hereVery simple (but useful) optimizations proposed here Optimizations could be applied to other systemsOptimizations could be applied to other systems

Page 46: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

The EndThe End

Page 47: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Backup SlidesBackup Slides

Page 48: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Table 4Table 4

Page 49: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Table 5Table 5

Page 50: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

list *list *makeListmakeList(int Num) {(int Num) { list *New = list *New = mallocmalloc(sizeof(list));(sizeof(list)); New->Next = Num ? New->Next = Num ? makeListmakeList(Num-1) : 0;(Num-1) : 0; New->Data = Num; return New;New->Data = Num; return New;}}

int int twoListstwoLists( ) {( ) {

list *X = list *X = makeListmakeList(10);(10); list *Y = list *Y = makeListmakeList(100);(100); GL = Y;GL = Y; processListprocessList(X);(X); processListprocessList(Y);(Y); freeListfreeList(X);(X); freeListfreeList(Y);(Y);

}}

Pool Allocation: ExamplePool Allocation: Example

Pool P1; poolinit(&P1);

pooldestroy(&P1);

, &P1)

, Pool* P){poolalloc(P);

, P)

, &P1)

P1

, P2)

Pool* P2

, P2)

P2

Change calls to free into Change calls to free into calls to poolfree calls to poolfree retain retain

explicit deallocationexplicit deallocation

Page 51: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Different Data Structures Have Different PropertiesDifferent Data Structures Have Different Properties Pool allocation segregates heap:Pool allocation segregates heap:

Roughly into logical data structuresRoughly into logical data structures Optimize using pool-specific propertiesOptimize using pool-specific properties

Examples of properties we look for:Examples of properties we look for: Pool is type-homogenousPool is type-homogenous Pool contains data that only requires 4-byte alignmentPool contains data that only requires 4-byte alignment Opportunities to reduce allocation overheadOpportunities to reduce allocation overhead

buildbuildtraversetraversedestroy destroy

complex complex allocation allocation

patternpattern

Pool Specific OptimizationsPool Specific Optimizations

list: HMRlist* int

head

list: HMRlist* int

head

list: HMRlist* int

head

Page 52: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

BenchmarksBenchmarks

Pointer-intensive SPECINT 2000, Ptrdist, Pointer-intensive SPECINT 2000, Ptrdist, Olden, FreeBench suitesOlden, FreeBench suites

povray, espresso, fpgrowth, llu-bench, povray, espresso, fpgrowth, llu-bench, chompchomp

Benchmarks with custom allocators are not Benchmarks with custom allocators are not evaluated except for parser, which they hand evaluated except for parser, which they hand modified.modified.