34
Programming, Debugging, Profiling and Optimizing Transactional Memory Applications Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center 01 July 2010 Ferad Zyulkyarov PhD Thesis Proposal

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

  • Upload
    shiro

  • View
    65

  • Download
    8

Embed Size (px)

DESCRIPTION

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications. PhD Thesis Proposal. Department of Computer Architecture Universitat Politècnica de Catalunya – BarcelonaTech Barcelona Supercomputing Center. Ferad Zyulkyarov. 01 July 2010. Publications. - PowerPoint PPT Presentation

Citation preview

Page 1: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Department of Computer ArchitectureUniversitat Politècnica de Catalunya – BarcelonaTech

Barcelona Supercomputing Center

01 July 2010

Ferad Zyulkyarov

PhD Thesis Proposal

Page 2: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Publications• Ferad Zyulkyarov, Srdjan Stipic, Tim Harris, Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero,

Discovering and Understanding Performance Bottlenecks in Transactional Applications, PACT'10• Ferad Zyulkyarov, Tim Harris, Osman Unsal, Adrian Cristal, Mateo Valero, Debugging Programs that

use Atomic Blocks and Transactional Memory, PPoPP'10• Vladimir Gajinov, Ferad Zyulkyarov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris,

Mateo Valero, QuakeTM: Parallelizing a Complex Serial Application Using Transactional Memory , ICS'09

• Ferad Zyulkyarov, Vladimir Gajinov, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Atomic Quake: Using Transactional Memory in an Interactive Multiplayer Game Server , PPoPP’09

• Ferad Zyulkyarov, Sanja Cvijic,Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, WormBench - A Configurable Workload for Evaluating Transactional Memory Systems, MEDEA '09

• Ferad Zyulkyarov, Milos Milovanovic, Osman Unsal, Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo Valero, Memory Management for Transaction Processing Core in Heterogeneous Chip-Multiprocessors, OSHMA '09

• Milos Milovanovic, Osman Unsal, Adrian Cristal, Ferad Zyulkyarov, Mateo Valero, Compiler Support for Using Transactional Memory in C/C++ Applications, INTERACT’07

2

Page 3: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Work Plan

3

12m

11m

21m

10m

15m

9.5m

7m

2m

01/10/2010

Page 4: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Transactional Memory

4

atomic { statement1; statement2; statement3; statement4; ...}

Page 5: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

The Big Questions

• Is programming with TM easy?• Is TM competitive with locks?• Are existing development tools sufficient?

5

Page 6: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Atomic Quake

• Parallel Quake game server– All locks are replaces with atomic blocks

• 27,400 LOC of C code in 56 files• Rich transactional application

– 63 atomic blocks– Rich uses of atomic blocks

• Library calls, I/O, error handling, memory allocation, failure atomicity

– Various transactional characteristics• A workload to drive research in TM

6

Page 7: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Is programming with TM easy?

• Yes.• In large applications where we have many

shared objects and want to provide efficient fine grain synchronization– Example: region based locking in tree data

structure and graphs.

7

Page 8: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Where Transactions Fit?Guarding different types of objects with separate locks.

1 switch(object->type) { /* Lock phase */ 2 KEY: lock(key_mutex); break; 3 LIFE: lock(life_mutex); break; 4 WEAPON: lock(weapon_mutex); break; 5 ARMOR: lock(armor_mutex); break 6 }; 7 8 pick_up_object(object); 910 switch(object->type) { /* Unlock phase */11 KEY: unlock(key_mutex); break;12 LIFE: unlock(life_mutex); break;13 WEAPON: unlock(weapon_mutex); break;14 ARMOR: unlock(armor_mutex); break15 };

Lock phase.

Unlock phase.

atomic {

}

pick_up_object(object);

8

Page 9: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Is TM Competitive to Locks?

• No. – 4-5x slowdown on single

threaded version.

• But it is promising to be competitive because of the obtained good scalability.

9

Scales OK up to 4 threads.

ThreadsTransaction

s

AbortsIrrevocable

Num %

1 36 667 0 0.00% 172 75 824 241 0.42% 314 166 000 2 612 1.58% 858 477 519 76 771 25.50% 237

Sudden increase in aborts.

Page 10: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Are Existing Tools Sufficient?

• No• We need:

– Richer language level primitives and integration.– Mechanisms to handle I/O.– Dynamic error handling.– Debuggers.– Profilers.

10

Page 11: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Unstructured Use of LocksLocks

1 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 2 <statements1> 3 LOCK(cl_msg_lock[c - svs.clients]); 4 <statemnts2> 5 if (!c->send_message) { 6 <statements3> 7 UNLOCK(cl_msg_lock[c - svs.clients]); 8 <statements4> 9 continue;10 }11 <stamemnts5>12 if (!sv.paused && !Netchan_CanPacket (&c->netchan)) {13 <statmenets6>14 UNLOCK(cl_msg_lock[c - svs.clients]);15 <statements7>16 continue;17 }18 <statements8>19 if (c->state == cs_spawned) {20 if (frame_threads_num > 1) LOCK(par_runcmd_lock);21 <statements9>22 if (frame_thread_num > 1) UNLOCK(par_runcmd_lock);23 }24 UNLOCK(cl_msg_lock[c - svs.clients]);25 <statements10>26 }

Atomic Block 1 bool first_if = false; 2 bool second_if = false; 3 for (i=0; i<sv_tot_num_players/sv_nproc; i++){ 4 <statements1> 5 atomic { 6 <statemnts2> 7 if (!c->send_message) { 8 <statements3> 9 first_if = true;10 } else {11 <stamemnts5>12 if (!sv.paused && !Netchan_CanPacket(&c->netchan)){13 <statmenets6>14 second_if = true;15 } else {16 <statements8>17 if (c->state == cs_spawned) {18 if (frame_threads_num > 1) {19 atomic {20 <statements9>21 }22 } else {23 <statements9>;24 }25 }26 }27 }28 }29 if (first_if) {30 <statements4>;31 first_if = false;32 continue;33 }34 if (second_if) {35 <statements7>;36 second_if = false;37 continue;38 }39 <statements10>40 }

Extra variables and code

Solutionexplicit “commit” Complicated

Conditional Logic

11

Page 12: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Various Transactional Characteristics

ID TX#Dynamic Length (CPU Cycles) Read Set (Bytes) Write Set (Bytes)

Total Min Max Avg Total Min Max Avg Total Min Max Avg56 26,962 172,872,572 288 112,832 6,412 1,328,536 20 104 49 0 0 0 060 5,931 5,810,152 224 41,552 980 76,212 12 640 13 928 0 116 061 1,095 20,573,540 4,560 49,984 19,208 723,474 88 776 661 90 84 84 8459 1,042 3,117,844 1,520 39,344 2,999 29,176 5 28 28 16,672 16 16 1657 1,038 401,502,152 288,704 522,528 387,552 10,963,719 7,614 15,490 10,562 2,592,367 1,680 3,656 2,49758 1,002 134,949,344 87,056 1,341,504 134,949 5,054,282 3,028 53,566 5,044 931,445 548 11,161 93015 3 67,660 720 48,176 1,735 96 32 32 32 18 6 6 6

5 2 99,988 592 36,384 1,923 64 32 32 32 10 5 5 522 2 43,632 12,176 35,504 21,816 72 36 36 36 128 64 64 6436 2 40,476 6,800 44,880 20,238 249 108 141 125 55 22 33 2838 2 71,368 2,144 31,504 4,461 90 44 46 45 26 12 14 13

12

Very small transactions

Very large transactions

Different execution frequency -> Phased

behavior.

Control flow does not reach all atomic blocks.

Most frequent atomic block is read-only.

Per-atomic block runtime statistics from Atomic Quake.

Page 13: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Debugging Transactional Applications

• Existing debuggers are not aware of atomic blocks and transactional memory

• New principles and approaches:– Debugging atomic blocks atomically– Debugging at the level of transactions– Managing transactions at debug-time

• Extension for WinDbg to debug programs with atomic blocks

13

Page 14: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Atomicity in Debugging• Step over atomic blocks as if single instruction.• Abstracts weather atomic blocks are implemented with TM

or lock inference• Good for debugging sync errors at granularity of atomic

blocks vs. individual statements inside the atomic blocks.

14

<statement 1><statement 2>atomic { <statement 3> <statement 4> <statement 5> <statement 6>}<statement 7><statement 8>

<statement 1><statement 2>atomic { <statement 3> <statement 4> <statement 5> <statement 6>}<statement 7><statement 8>

Non-TM Aware Debugger TM Aware Debugger

Debugging becomes frustrating when

transaction aborts.

Page 15: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Isolation in Debugging

• What if we want to debug wrong code within atomic block?– Put breakpoint inside atomic block.– Validate the transaction– Step within the transaction.

• The user does not observe intermediate results of concurrently running transactions– Switch transaction to irrevocable mode after validation.

15

atomic { <statement 1> <statement 2> <statement 3> <statement 4>}

Page 16: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Debugging at the Level of Transactions

• Assumes that atomic blocks are implemented with transactional memory.

• Examine the internal state of the TM– Read/write set, re-executions, status

• TM specific watch points– Break when conflict happens– Filters

• Concurrent work with Herlihy and Lev [PACT’ 09].

16

Page 17: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

TM Specific Watchpoints

17

atomic { <statement 1> <statement 2> <statement 3> <statement 4>}

Conflict Information

Conflicting Threads: T1, T2Address: 0x84D2F0Symbol: reservation@04Readers: T1Writers: T2

Break when conflict happens

Filter: Break ifAddress = reservation@04Thread = T2

AND

Page 18: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Managing Transactions at Debug-Time

• At the level of atomic blocks– Debug time atomic blocks– Splitting atomic blocks

• At the level of transactions– Changing the state of TM system (i.e. adding and

removing entries from read/write set, change the status, abort)

• Analogous to the functionality of existing debuggers to change the CPU state

18

Page 19: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Example Debug Time Atomic Blocks

19

<statement 1><statement 2><statement 3><statement 4><statement 5><statement 6><statement 7><statement 8><statement 9><statement 10><statement 11><statement 12><statement 13><statement 14>

Page 20: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Example Debug Time Atomic Blocks

20

<statement 1><statement 2><statement 3>StartDebugAtomic<statement 4><statement 5><statement 6><statement 7><statement 8><statement 9>EndDebugAtomic<statement 10><statement 11><statement 12><statement 13><statement 14>

User marks the startand the end of thetransactions

Page 21: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Issues of Profiling TM Programs

• TM applications have unanticipated overheads– Problem raised by Pankratius [talk at ICSE’09] and

Rossbach et al. [PPoPP’10]• Difficult to profile TM applications without

profiling tools and without knowing the implementation of the TM system– Experience of optimizing QuakeTM, Gajinov et al.

[ICS’2009]

21

Page 22: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Profiling TM Programs

• Design principles– Report results at source language constructs– Abstract the underlying TM system– Low probe effect and overhead

• Profiling techniques– Conflict point discovery– Identifying conflicting data structures– Visualizing transactions

22

Page 23: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Conflict Point Discovery

• Identifies the statements involved in conflicts• Provides contextual information

• Finds the critical path

23

File:Line #Conf. Method Line

Hashtable.cs:51 152 Add If (_container[hashCode]…

Hashtable.cs:48 62 Add uint hashCode = HashSdbm(…

Hashtable.cs:53 5 Add _container[hashCode] = n …

Hashtable.cs:83 5 Add while (entry != null) …

ArrayList.cs:79 3 Contains for (int i = 0; i < count; i++ )

ArrayList.cs:52 1 Add if (count == capacity – 1) …

Page 24: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Call Context

24

increment() { counter++;}

probability80 { probability = random() % 100; if (probability < 80) { atomic { increment(); } }}

probability20 { probability = random() % 100; if (probability >= 80) { atomic { increment(); } }}

for (int i = 0; i < 100; i++) { probability80(); probability20();}

for (int i = 0; i < 100; i++) { probability80(); probability20();}

Thread 1

Thread 2

Bottom-up view+ increment (100%) |---- probability80 (80%) |---- probability20 (20%)

Top-down view+ main (100%) |---- probability80 (80%) |---- increment (80%) |-----probability20 (20%) |---- increment (20%)

Page 25: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Aborts Graph (Bayes)

25

AB1 AB2

AB3Conf: 73%Wasted: 63%

Conf: 20%Wasted: 29%

72% of wasted work

There are 15 atomic blocks and only one of them aborts most.Which atomic blocks cause AB3 to abort?

Page 26: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Indentifying Conflicting Objects

26

Per-Object View

+ List.cs:1 “list” (42%) |--- ChangeNode (20 %) +---- Replace (12%) +---- Add (8%)

1: List list = new List();2: list.Add(1);3: list.Add(2);4: list.Add(3);...atomic { list.Replace(2, 33);}

List 1 2 3

0x08 0x10 0x18 0x20

GC Memory Allocator DbgEng

Object Addr0x20

GC Root0x08

Instr Addr0x446290 List.cs:1

Page 27: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Transaction Visualizer (Genome)

27

Aborts occur at the first and last atomic blocks in

program order.

Garbage Collection Wait on barrier

Page 28: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Overhead and Probe Effect

28

Thrd# Bayes+ Bayes- Gen+ Gen- Intrd+ Intrd- Labr+ Labr- Vac+ Vac- WB+ WB-

1 1.59 1.00 1.27 1.00 1.29 1.00 1.07 1.00 1.26 1.00 0.71 1.002 1.00 0.56 0.97 0.67 0.97 0.58 0.64 0.61 0.83 0.59 0.60 0.554 0.23 0.23 0.73 0.52 0.91 0.36 0.45 0.46 0.58 0.40 0.41 0.338 0.21 0.20 0.73 0.55 1.57 0.38 0.72 0.56 0.53 0.34 0.33 0.22

Normalized Execution Time

Thrd# Bayes+ Bayes- Gen+ Gen- Intrd+ Intrd- Labr+ Labr- Vac+ Vac- WB+ WB-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.002 4.39 4.69 0.07 0.07 3.69 3.51 0.19 0.15 0.80 0.80 0.00 0.004 16.29 27.31 0.26 0.36 14.90 13.65 0.35 0.36 2.30 2.45 0.00 0.008 53.74 66.08 0.53 0.80 39.64 37.41 0.40 0.47 4.91 5.30 0.02 0.03

Abort Rate in %

+ Profiling Enabled- Profiling Disabled

Standard deviation for the difference 27%

Standard deviation for the difference 3.88%

Process data offline or during GC.

Page 29: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Optimization Techniques

• Moving statements• Atomic block scheduling• Checkpoints and nested atomic blocks• Pessimistic reads• Early release

29

Page 30: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Will this code execute the same?

Moving Statementsatomic { counter++; <statement1> <statement2> <statement3>}

atomic { <statement1> <statement2> <statement3> counter++;}

30

No!

Page 31: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Checkpointsatomic { <statement1> <statement2> <statement3> <statement4> <statement5>

<statement6> <statement7>

}

31

Conflicts

2%

15%4%

79%

Insert Checkpoint

Page 32: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Checkpointsatomic { <statement1> <statement2> <statement3> <statement4> <statement5>

<statement6> <checkpoint> <statement7>

}

32

Conflicts

2%

15%4%

79%

Insert Checkpoint

Reduced wasted work for the atomic

block with 40%.

Page 33: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Conclusion

• Study the programmability aspects of TM• New debugging principles and approaches for

TM applications• New profiling techniques for TM applications• Profile-guided optimization approaches for TM

applications

33

Page 34: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

34

Край