26
NUCA Locks NUCA Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Efficient Synchronization for Non-Uniform Communication Architecture Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se

Efficient Synchronization for Non-Uniform Communication Architecture

  • Upload
    ryann

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Efficient Synchronization for Non-Uniform Communication Architecture. Uppsala University Department of Information Technology Uppsala Architecture Research Team [ UART ]. Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se. Synchronization Basics. A:=0. BARRIER. - PowerPoint PPT Presentation

Citation preview

NUCA LocksNUCA Locks

Uppsala UniversityDepartment of Information Technology

Uppsala Architecture Research Team [UART]

Efficient Synchronization forNon-Uniform Communication Architecture

Zoran Radovic and Erik Hagersten{zoran.radovic, erik.hagersten}@it.uu.se

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Synchronization Basics

Locks are used to protect the shared critical section data

A:=0 BARRIER

LOCK(L)A:=A+1

UNLOCK(L)LOCK(L)B:=A+5

UNLOCK(L)

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Simple Spin Locks

test_and_test&set (TATAS), ‘84 TATAS with exponential backoff (TATAS_EXP), ‘90

Many variations

P1

$

P2

$

P3

$

Pn

$

Memory

FREELock:

P3

BUSY

Busy-wait/backoff

FREEBUSYBUSY BUSY

TATAS_LOCK(L) { if (tas(L)) { do { if (*L) continue; } while (tas(L)); }}

TATAS_UNLOCK(L) { *L = 0; // = FREE}

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Performance Under Contention

Amount of Contention

Spin locks

Spin locksw/ backoff

CS

Co

st

IF (more contention) THEN less efficient CS …

IF (more contention) THEN less efficient CS …

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Making it Scalable: Queues …

First-come,first-served order Starvation avoidance Maximal fairness Reduced traffic

Queue-based locks HW: QOLB ‘89 SW: MCS ‘91 SW: CLH ‘93

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Queue Locks Under Contention

Amount of Contention

Spin locks

Spin locksw/ backoff

CS

Co

st

Queue-based locks IF (more contention) THEN constant CS cost …

IF (more contention) THEN constant CS cost …

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Switch

Non-Uniform MemoryArchitecture (NUMA)

Many NUMA optimizations are proposed Page migration Page replication

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

12 – 10

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Non-Uniform CommunicationArchitecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 2 – 10

Our NUCA …

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Our Goals

Design a scalable spin lock that exploits the NUCAs Creating node affinity

• For lock handover

• For CS data

“Stable lock” Reducing the traffic compared with the test&set locks

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Outline

Background & MotivationNUMA vs. NUCA The RH Lock Performance Results Application Study Conclusions

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Key Ideas Behind RH Lock

Minimizing global traffic at lock-handover Only one thread per node will try to acquire a remotely

owned lock

Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention

RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

The RH Lock Algorithm

FREE

P1

$

P2

$

P3

$

P16

$

Cabinet 1: Memory

REMOTE

P17

$

P18

$

P19

$

P32

$

Cabinet 2: Memory

FREEREMOTELock1:

Lock2:

Lock1:

Lock2:

P2

2

P19

19else:

TATAS(my_TID, Lock)until FREE or

L_FREE

if “REMOTE”:Spin remotely

CAS(FREE, REMOTE)until FREE

(w/ exp backoff)

… …

FREECS

1

2

16

1 REMOTE

32L_FREE

Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!

Release:CAS(my_TID, FREE) else L_FREE)

16

FREECS

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Our NUCA: Sun WildFire

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 6

14 14

WF

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

NUCA-performance14 14

0

5

10

15

20

25

30

35

40

45

50

55

60

0 4 8 12 16 20 24 28

Number of Processors

Tim

e [m

icro

seco

nds]

TATAS

TATAS_EXP

MCS

CLH

RH

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28

Number of Processors

Nod

e ha

ndof

fs [

%]

TATAS

TATAS_EXP

MCS

CLH

RH

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

New Microbenchmark

for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_work); // CS UNLOCK(L); static_delay(); random_delay();}

More realistic node handoffs for queue-based locks Constant number of processors Amount of Critical Section (CS) work can be increased

we can control the “amount of contention”

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

3

4

5

6

7

8

9

10

11

12

13

0 500 1000 1500 2000

critical_work

Tim

e [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

WF

14 14

0

10

20

30

40

50

60

70

80

90

100

0 500 1000 1500 2000

critical_work

Nod

e ha

ndof

fs [

%]

TATAS

TATAS_EXP

MCS

CLH

RH

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Traffic MeasurementsNew microbenchmark; critical_work = 1500

0.00.10.20.30.40.50.60.70.80.91.0

TATAS_EXP MCS CLH RH

Local Transactions Global Transactions

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Application PerformanceRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS

TATAS_EXP

MCS

CLH

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Application PerformanceRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup TATAS

TATAS_EXP

MCS

CLH

RH

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

RH Lock Under Contention

Amount of Contention

Queue-based locks

Spin locks

Spin locksw/ backoff

CS

Co

st

RH lock

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Total Traffic: Raytrace

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

TATAS TATAS_EXP MCS CLH RH

Local Transactions Global Transactions

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Application Performance28-processor runs

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Barne

s

Chole

sky

FMM

Radio

sity

Raytra

ce

Volre

nd

Wat

er-N

sq

Avera

ge

Nor

mal

ized

Spe

edup

TATAS TATAS_EXP MCS CLH RH

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

First-come, first-served not desirable for NUCAs The RH lock exploits NUCAs by

creating locality through CS affinity (stable lock) reducing traffic compared with the test&set locks

The first lock that performs better under contention Global traffic is significantly reduced Applications with contented locks scale better with RH

locks on NUCAs

Conclusions

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Any Drawbacks?

Proof-of-concept NUCA-aware lock for 2 nodes Hard to port to some architectures

Memory needs to be allocated/placed in different nodes

Lock storage is proportional to #NUCA nodes Sensitive for starvation

“Non-uniform nature” of the algorithm No mechanism for lowering the risk of starvation

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

Can We Fix It?

We propose a new set of NUCA-aware locks Hierarchical Backoff

Locks (HBO) HPCA-9: Anaheim,

California, February 2003

Teaser … Portable Scalable to many NUCA

nodes Only cas atomic

operations are used Only node_id is needed Lowers the risk of

starvation

Supercomputing 2002Supercomputing 2002 Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) NUCA LocksNUCA Locks

http://www.it.uu.se/research/group/uart

UART’s Home Page