28
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel Corporation *Computer Science Division University of California, Berkeley Cheng Wang, *Wei-Yu Chen, Youfeng Wu, Bratin Saha, Ali Adl-Tabatabai

Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

  • Upload
    helga

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language. Cheng Wang , *Wei-Yu Chen, Youfeng Wu, Bratin Saha, Ali Adl-Tabatabai. Programming Systems Lab Microprocessor Technology Labs Intel Corporation. *Computer Science Division - PowerPoint PPT Presentation

Citation preview

Page 1: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

Programming Systems LabMicroprocessor Technology Labs

Intel Corporation

*Computer Science DivisionUniversity of California, Berkeley

Cheng Wang, *Wei-Yu Chen, Youfeng Wu, Bratin Saha, Ali Adl-Tabatabai

Page 2: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

2

Motivation

• Existing Transactional Memory (TM) constructs focus on managed Language

• Efficient software transactional memory (STM) takes advantages of managed language features– Optimistic Versioning (direct update memory with backup)– Optimistic Read (invisible read)

• Challenges in Unmanaged Language (e.g. C)– Consistency

• No type safety, first-class exception handling– Function call

• No just-in-time compilation– Stack rollback

• Stack alias– Conflict detection

• Not object oriented

Page 3: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

3

Contributions

• First to introduce comprehensive transactional memory construct to C programming language– Transaction, function called within transaction, transaction

rollback, …

• First to support transactions in a production-quality optimizing C compiler– Code generation, optimization, indirect function calls, …

• Novel STM algorithm and API that supports optimizing compiler in an unmanaged environment– quiescent transaction, stack rollback, …

Page 4: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

4

Outline

• TM Language Construct

• STM Runtime

• Code Generation and Optimization

• Experimental Results

• Related Work

• Conclusion

Page 5: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

5

TM Language Constructs

• #pragma tm_atomic

{

stmt1;

stmt2;

}

• #pragma tm_atomic

{

stmt 1;

#pragma tm_atomic

{

stmt2;

tm_abort();

}

}

• #pragma tm_function

int foo(int);

int bar(int);

#pragma tm_atomic

{

foo(3); // OK

bar(10); // ERROR

}

foo(2) // OK

bar(1) // OK

Page 6: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

6

Consistency Problem

#pragma tm_atomic

{

if(tq->free) {

for(temp1 = tq->free;

temp1->next &&…,

temp1 = temp1->next);

task_struct[p_id].loc_free = tq->free;

tq->free = temp1->next;

temp1->next = NULL;

}

}

#pragma tm_atomic

{

if(tq->free) {

for(temp2 = tq->free;

temp2->next &&…,

temp2 = temp2->next);

task_struct[p_id].loc_free = tq->free;

tq->free = temp2->next;

temp2->next = NULL;

}

}

Memory Fault

NULL

NULL

Not NULL

• Solution: timestamp based aggressive consistent checking

Thread 1 Thread 2

shared free list

local free list

Page 7: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

7

Inconsistency Caused by Privatization

#pragma tm_atomic

{

if(tq->free) {

for(temp1 = tq->free;

temp1->next &&…,

temp1 = temp1->next);

task_struct[p_id1].loc_free = tq->free;

tq->free = temp1->next;

temp1->next = NULL;

}

}

temp1 = task_struct[p_id1].loc_free;

/* process temp */

task_struct[p_id1].loc_free = temp1->next;

temp1->next = NULL;

#pragma tm_atomic

{

if(tq->free) {

for(temp2 = tq->free;

temp2->next &&…,

temp2 = temp2->next);

task_struct[p_id2].loc_free = tq->free;

tq->free = temp2->next;

temp2->next = NULL;

}

}

temp2 = task_struct[p_id2].loc_free;

/* process temp */

task_struct[p_id2].loc_free = temp2->next;

temp2->next = NULL;

NULL

Not NULL

Memory FaultNULL

• Solution: Quiescent Transaction

Thread 1 Thread 2

Page 8: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

8

Quiescent Transaction

#pragma tm_atomic

{

if(tq->free) {

for(temp1 = tq->free;

temp1->next &&…,

temp1 = temp1->next);

task_struct[p_id1].loc_free = tq->free;

tq->free = temp1->next;

temp1->next = NULL;

}

}

temp1 = task_struct[p_id1].loc_free;

/* process temp */

task_struct[p_id1].loc_free = temp1->next;

temp1->next = NULL;

#pragma tm_atomic

{

if(tq->free) {

for(temp2 = tq->free;

temp2->next &&…,

temp2 = temp2->next);

task_struct[p_id2].loc_free = tq->free;

tq->free = temp2->next;

temp2->next = NULL;

}

}

temp2 = task_struct[p_id2].loc_free;

/* process temp */

task_struct[p_id2].loc_free = temp2->next;

temp2->next = NULL;

Not NULL

Consistency Checking FailQuiescent

Thread 1 Thread 2

Page 9: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

9

TM Runtime Issues (Stack Rollback)

#pragma tm_atomic

{

foo()

… // abort

}

foo() {

int a;

bar(&a)

}

bar(int *p) {

*p …

}

Stack Crash

back a

rollback a

• Solution: Selective Stack Rollback

Page 10: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

10

Optimization Issues (Redundant Barrier)#pragma tm_atomic

{

a = b + 1;

…; // may alias a or b

a = b + 1;

}

desc = stmGetTxnDesc();

rec1 = IRComputeTxnRec(&b);

ver1 = IRRead(desc, rec1);

t = b;

IRCheckRead(desc, rec1, ver1);

desc = stmGetTxnDesc();

rec2 = IRComputeTxnRec(&a);

IRWrite(desc, rec2);

IRUndoLog(desc, &a);

a = t + 1;

desc = stmGetTxnDesc();

rec1 = IRComputeTxnRec(&b);

ver1 = IRRead(desc, rec1);

t = b;

IRCheckRead(desc, rec1, ver1);

desc = stmGetTxnDesc();

rec2 = IRComputeTxnRec(&a);

IRWrite(desc, rec2);

IRUndoLog(desc, &a);

a = t + 1;

not redundant

Page 11: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

11

Experiment Setup

• Target System– 16-way IBM eServer xSeries 445, 2.2GHz Xeon– Linux 2.4.20, icc v9.0 (with STM), -O3

• Benchmarks– 3 synthetic concurrent data structure benchmarks• Hashtable, btree, avltree

– 8 SPLASH-2 benchmarks• 4 SPLASH-2 benchmarks spend little time in critical sections

– Fine-grained lock v. coarse-grained lock v. STM• Coarse-grain lock: replace all locks with a single global lock• STM:

– Replace all lock sections with transactions– Put non-transactional conflicting accesses in transactions

Page 12: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

12

Hashtable

• STM scales similarly as fine grain lock

• Manual and compiler STM comparable performance

hashtable (80% update)

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20

threads

tim

e (s

eco

nd

s)fine lock

coarse lock

manual stm

compiler stm

compile stm -- noconsistency

Page 13: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

13

FMM

• STM is much better than coarse-grain lock

FMM

0

1

2

3

4

5

0 5 10 15 20

threads

tim

e (s

eco

nd

s)

fine lock

stm

coarse lock

no consistency

Page 14: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

14

Splash 2

• STM can be more scalable than locks

raytrace

012345678

0 5 10 15 20

threads

tim

e (s

eco

nd

s)

fine lock

stm

coarse lock

Page 15: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

15

Optimization Benefits

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

barnes cholesky fmm raytrace radiosity geo-mean

No

rma

lize

d E

xe

cu

tio

n T

ime

no optbarrier eliminliningfull optno consistencylock

• The overhead is within 15%, with average only 6.4%

Page 16: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

16

Related Work

• Transactional Memory– [Herlihy, ISCA93]– [Ananian, HPCA05], [Rajwar, ISCA05], [Moore, HPCA06],

[Hammond, ASPLOS04], [McDonald, ISCA06], [Saha, MICRO 06]

• Software Transactional Memory– [Shavit, PODC95], [Herlihy, PODC03], [Harris, ASPLOS04]

• Prior work on TM constructs in managed languages– [Adl-Tabatabai, PLDI06], [Harris, PLDI06], [Carlstrom, PLDI06],

[Ringengerg, ICFP05]

• Efficient STM– [Saha, PPoPP06]

• Time-stamp based approach– [Dice, DISC06], [Riegel, DISC06]

Page 17: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

17

Conclusion

• We solve the key STM compiler problems for unmanaged languages– Aggressive consistency checking– Static function cloning– Selective stack rollback– Cache-line based conflict detection

• We developed a highly optimized STM compiler– Efficient register rollback– Barrier elimination– Barrier inlining

• We evaluated our STM compiler with well-known parallel benchmarks– The optimized STM compiler can achieve most of the hand-coded

benefits– There are opportunities for future performance tuning and

enhancement

Page 18: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

Questions ?

Page 19: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

19

STM Runtime API

TxnDesc* stmGetTxnDesc();

uint32 stmStart(TxnDesc*, TxnMemento*);

uint32 stmStartNested(TxnDesc*, TxnMemento*);

void stmCommit(TxnDesc*);

void stmCommitNested(TxnDesc*);

void stmUserAbort(TxnDesc*);

void stmAbort(TxnDesc*);

uint32 stmValidate(TxnDesc*);

uint32* stmComputeTxnRec(uint32* addr);

uint32 stmRead(TxnDesc*, uint32* txnRec);

void stmCheckRead(TxnDesc*, uint32* txnRec, uint32 version);

void stmWrite(TxnDesc*,uint32* txnRec);

Void stmUndoLog(TxnDesc*, uint32* addr,uint32 size);

Page 20: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

20

Data Structureshashtable (80% update)

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20

threads

tim

e (

se

co

nd

s)

fine

coarse

manual stm

compiler stm

compile stm -- novalidation

hashtable (20% update)

0

0.5

1

1.5

2

2.5

3

3.5

0 5 10 15 20

threads

tim

e (s

eco

nd

s)

btree (80% update)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

threads

tim

e (

seco

nd

s)

fine

coarse

manual stm

compiler stm

compiler stm -novalidation

btree (20% update)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

threadsti

me

(sec

on

ds)

avltree (80% update)

0

1

2

3

4

5

6

7

0 5 10 15 20

threads

tim

e (

seco

nd

s)

fine

coarse

manual stm

compiler stm

avltree (20% update)

0

1

2

3

4

5

6

0 5 10 15 20

threads

tim

e (

seco

nd

s)

Page 21: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

21

Example 1

• #pragma tm_atomic

• {

• t = head;

• Head = t->next;

• }

• … = *t;

• #pragma tm_atomic

• {

• s = head;

• *s = …;

• }

Page 22: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

22

Example 2

• #pragma tm_atomic

• {

• t = head;

• head = t->next;

• }

• … = *t;

• #pragma tm_atomic

• {

• s = head;

• *s = …;

• head = s->next;

• }

Page 23: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

23

Example 3

• #pragma tm_atomic

• {

• t = head;

• head = t->next;

• }

• *t = …;

• #pragma tm_atomic

• {

• s = head;

• … = *s;

• head = s->next;

• }

Page 24: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

24

Optimization Issues (Register Checkpointing)

• Checkpointing all the live-in local data does not work with compiler optimizations across transaction boundary

• Checkpointing Code

t2_bkup = t2;

while(setjmp(…)) {

t2 = t2_bkup;

}

stmStart(…)

t1 = 0;

t2 = t1 + t2;

stmCommit(…);

t1 = t3;

t3 = 1;

• Source Code

#pragma tm_atomic

{

t1 = 0;

t2 = t1 + t2;

}

t1 = t3;

t3 = 1;

• Optimized Code

t2_backup = t2;

t1 = 0;

while(setjmp(…)) {

t2 = t2_bkup;

}

stmStart(…);

t2 = t1 + t2;

t1 = t3;

t3 = 1;

stmCommit(…);

can not recover

Abort

Page 25: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

25

TimeStamp based Consistency Checking

#pragma tm_atomic

{

if(tq->free) {

for(temp1 = tq->free;

temp1->next &&…,

temp1 = temp1->next);

task_struct[p_id].loc_free = tq->free;

tq->free = temp1->next;

temp1->next = NULL;

}

}

#pragma tm_atomic

{

if(tq->free) {

for(temp2 = tq->free;

temp2->next &&…,

temp2 = temp2->next);

task_struct[p_id].loc_free = tq->free;

tq->free = temp2->next;

temp2->next = NULL;

}

}

Version 1

Global Timestamp 1

Version 0

Version 1

Global Timestamp 0

Local Timestamp 0

Thread 1 Thread 2

Local Timestamp 0

Page 26: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

26

Checkpointing Approach

#pragma tm_atomic

{

t2 = t1 + t2;

t1 = t3;

t3 = 1;

}

t2_bkup = t2

t3_bkup = t3

t1 = 0

t2 = t2_bkup

t3 = t3_bkup

t1 = 0

retry entry

#pragma tm_atomic

{

t1 = 0

t2 = t1 + t2;

}

t1 = t3;

t3 = 1;

t2_bkup = t2

t3_bkup = t3

t2 = t2_bkup

t3 = t3_bkup

retry entrynormal entry

Optimization

normal entry

Page 27: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

27

Function Clone

• Source Code

#pragma tm_function

void foo(…) {

}

#pragma tm_atomic

{

foo();

(*fp)();

}

• STM Code

<foo-4>:

&foo_tm

<foo>: // normal version

no-op maker

… // normal code

<foo_tm>: // transactional version

… // code for transaction

foo_tm();

if(*fp == “no-op marker”)

(**(fp-4))(); // call foo_tm

else

handle non-TM binary

Point to transactional version

Unique Marker

Page 28: Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language

28

• STM is much better than coarse-grain lock (fine lock ???)

barnes hut

0

2

4

6

8

10

12

0 5 10 15 20

threads

tim

e (s

eco

nd

s)

lock

stm

coarse-lock