30
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

  • Upload
    verda

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures. M. Aater Suleman* Onur Mutlu † Moinuddin K. Qureshi ‡ Yale N. Patt*. *The University of Texas at Austin. †Carnegie Mellon University. ‡IBM Research. Background. To leverage CMPs: - PowerPoint PPT Presentation

Citation preview

Page 1: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

1

Accelerating Critical Section Executionwith Asymmetric Multi-Core Architectures

M. Aater Suleman*Onur Mutlu†

Moinuddin K. Qureshi‡Yale N. Patt*

*The University of Texas at Austin

†Carnegie Mellon University

‡IBM Research

Page 2: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

2

Background

• To leverage CMPs:– Programs must be split into threads

• Mutual Exclusion:– Threads are not allowed to update shared data concurrently

• Accesses to shared data are encapsulated inside critical sections

• Only one thread can execute a critical section at a given time

Page 3: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Example of Critical Section from MySQL

3

×

×

List of Open Tables

××

×

Thread 0

Thread 1

Thread 2

Thread 3

A

×

B C D

×

E

Thread 3:OpenTables(D, E)

Thread 2:CloseAllTables()

Page 4: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Example Critical Section from MySQL

4

A B C D

0

2 2

1

0

3

E

3

Page 5: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

5

Example Critical Section from MySQL

End of Transaction: foreach (table opened by thread)

if (table.temporary)table.close()

LOCK_openAcquire()

LOCK_openRelease()

Page 6: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

6

Contention for Critical Sections

t1 t2 t3 t4 t5 t6 t7

t1 t2 t3 t4 t5 t6 t7

Critical Sections execute 2x faster

Thread 1Thread 2Thread 3Thread 4

Thread 1Thread 2Thread 3Thread 4

Critical Section

Parallel

Idle

Accelerating critical sections not only helps the thread executing the

critical sections, but also the waiting threads

Page 7: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

7

Impact of Critical Sections on Scalability

• Contention for critical sections increases with the number of threads and limits scalability

MySQL (oltp-1)

Chip Area (cores)

Sp

eed

up

Page 8: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

8

Outline

• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary

Page 9: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

9

The Asymmetric Chip Multiprocessor (ACMP)

• Provide one large core and many small cores• Execute parallel part on small cores for

high throughput• Accelerate serial part using the large core

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Page 10: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

10

Conventional ACMP

EnterCS()

PriorityQ.insert(…)

LeaveCS()

On-chip Interconnect

1. P2 encounters a Critical Section2. Sends a request for the lock3. Acquires the lock4. Executes Critical Section5. Releases the lock

Core executing critical section

P1P2 P3 P4

Page 11: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

11

Accelerating Critical Sections (ACS)

• Accelerate Amdahl’s serial part and critical sections using the large core

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP Approach

Critical SectionRequest Buffer (CSRB)

Page 12: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

12

Accelerated Critical Sections (ACS)

EnterCS()

PriorityQ.insert(…)

LeaveCS()

Onchip-Interconnect

Critical SectionRequest Buffer (CSRB)

1. P2 encounters a Critical Section2. P2 sends CSCALL Request to CSRB3. P1 executes Critical Section4. P1 sends CSDONE signal

Core executing critical section

P4P3P2P1

Page 13: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

13

Architecture Overview

• ISA extensions– CSCALL LOCK_ADDR, TARGET_PC– CSRET LOCK_ADDR

• Compiler/Library inserts CSCALL/CSRET

• On a CSCALL, the small core:– Sends a CSCALL request to the large core

• Arguments: Lock address, Target PC, Stack Pointer, Core ID– Stalls and waits for CSDONE

• Large Core– Critical Section Request Buffer (CSRB)– Executes the critical section and sends CSDONE to the

requesting core

Page 14: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

14

False Serialization

• ACS can serialize independent critical sections

• Selective Acceleration of Critical Sections (SEL)– Saturating counters to track false serialization

CSCALL (A)

CSCALL (A)

CSCALL (B)

Critical Section Request Buffer(CSRB)

4

4

A

B

32

5

To large core

From small cores

Page 15: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

15

Outline

• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary

Page 16: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

16

Performance Tradeoffs

• Fewer threads vs. accelerated critical sections– Accelerating critical sections offsets loss in throughput– As the number of cores (threads) on chip increase:

• Fractional loss in parallel performance decreases• Increased contention for critical sections

makes acceleration more beneficial

• Overhead of CSCALL/CSDONE vs. better lock locality– ACS avoids “ping-ponging” of locks among caches by keeping

them at the large core

• More cache misses for private data vs. fewer misses for shared data

Page 17: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

17

Cache misses for private data

Private Data: NewSubProblems

Shared Data: The priority heap

PriorityHeap.insert(NewSubProblems)

Puzzle Benchmark

Page 18: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

18

Performance Tradeoffs

• Fewer threads vs. accelerated critical sections– Accelerating critical sections offsets loss in throughput– As the number of cores (threads) on chip increase:

• Fractional loss in parallel performance decreases• Increased contention for critical sections

makes acceleration more beneficial

• Overhead of CSCALL/CSDONE vs. better lock locality– ACS avoids “ping-ponging” of locks among caches by keeping

them at the large core

• More cache misses for private data vs. fewer misses for shared data– Cache misses reduce if shared data > private data

Page 19: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

19

Outline

• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary

Page 20: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

20

Experimental Methodology

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

SCMP

• All small cores

• Conventional locking

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACMP

• One large core (area-equal 4 small cores)

• Conventional locking

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Niagara-likecore

Largecore

ACS

• ACMP with a CSRB

• Accelerates Critical Sections

Page 21: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

21

Experimental Methodology

• Workloads– 12 critical section intensive applications from various domains– 7 use coarse-grain locks and 5 use fine-grain locks

• Simulation parameters:– x86 cycle accurate processor simulator– Large core: Similar to Pentium-M with 2-way SMT.

2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage

– Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5-stage

– Private 32 KB L1, private 256KB L2, 8MB shared L3– On-chip interconnect: Bi-directional ring

Page 22: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

22

Workloads with Coarse-Grain Locks

Chip Area = 16 coresSCMP = 16 small coresACMP/ACS = 1 large and 12 small cores

Equal-area comparisonNumber of threads = Best threads

Chip Area = 32 small coresSCMP = 32 small coresACMP/ACS = 1 large and 28 small cores

210 150 210 150

Page 23: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

23

Workloads with Fine-Grain Locks

Equal-area comparisonNumber of threads = Best threads

Chip Area = 16 coresSCMP = 16 small coresACMP/ACS = 1 large and 12 small cores

Chip Area = 32 small coresSCMP = 32 small coresACMP/ACS = 1 large and 28 small cores

Page 24: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Equal-Area Comparisons

24

Sp

eed

up

ove

r a

smal

l co

re

Chip Area (small cores)

(a) ep (b) is (c) pagemine (d) puzzle (e) qsort (f) tsp

(i) oltp-1 (i) oltp-2(h) iplookup (k) specjbb (l) webcache(g) sqlite

Number of threads = No. of cores

------ SCMP------ ACMP------ ACS

Page 25: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

25

ACS on Symmetric CMP

Majority of benefit is from large core

Page 26: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

26

Outline

• Background• Mechanism• Performance Trade-Offs• Evaluation• Related Work and Summary

Page 27: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

27

Related Work

• Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+)ACS not only improves locality but also uses a large core to

accelerate critical section execution

• Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+)ACS not only accelerates the Amdahl’s bottleneck but also critical

sections

• Remote procedure calls (Birrell+)ACS is for critical sections among shared memory cores

Page 28: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

28

Hiding Latency of Critical Sections

• Transactional memory (Herlihy+)ACS does not require code modification

• Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+)– Hide critical section latency by increasing concurrency

ACS reduces latency of each critical section– Overlaps execution of critical sections with no data conflicts

ACS accelerates ALL critical sections– Does not improve locality of shared data

ACS improves locality of shared data

ACS outperforms TLR (Rajwar+) by 18% (details in paper)

Page 29: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

29

Conclusion

• Critical sections reduce performance and limit scalability

• Accelerate critical sections by executing them on a powerful core

• ACS reduces average execution time by:– 34% compared to an equal-area SCMP– 23% compared to an equal-area ACMP

• ACS improves scalability of 7 of the 12 workloads

Page 30: Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

30

Accelerating Critical Section Executionwith Asymmetric Multi-Core Architectures

M. Aater Suleman*Onur Mutlu†

Moinuddin K. Qureshi‡Yale N. Patt*

*The University of Texas at Austin

†Carnegie Mellon University

‡IBM Research