NCSA Wiki - ï¿½ï¿½Prï¿½sentation PowerPoint · 2010. 11. 23. · Title: ï¿½ï¿½Prï¿½sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18

Charm++ on NUMA Platforms:

the impact of SMP Optimizations

and a NUMA-aware Load Balancer

Laércio Pilla (UFRGS/Brazil - INRIA)

Christiane Pousa (INRIA)

Daniel Cordeiro (USP/Brazil - INRIA)

Jean-François Méhaut (INRIA)

Outline

• Introduction

• Performance Evaluation of SMP Optimizations of

Charm++ on NUMA Machines

• NUMA-aware Load Balancer on Charm++

• Conclusion and Future Work

Motivation for NUMA platforms

• The number of cores per processor is

increasing

• Hierarchical shared memory multiprocessors

• ccNUMA is coming back (NUMA factor)

Node 3

M3 CPU 3

Node 0

M0 CPU 0

Node 1

M1 CPU 1

Node 2

M2 CPU 2

Node 2

M2 c c cc

Node 3

M3 c c cc

Node 0

M0 c c cc

Node 1

M1 c c cc

Now

NUMA problems

• Remote access

• Optimization of latency

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems

• Remote access


Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems

• Remote access


Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems

• Memory contention

• Optimization of bandwidth

Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems



Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems



Node 1

M1 c c cc

Node 3

M3 c c cc

Node 5

M5 c c cc

Node 7

M7 c c cc

Node 0

M0 c c cc

Node 2

M2 c c cc

Node 4

M4 c c cc

Node 6

M6 c c cc

NUMA problems

On NUMA machines,

data distribution

matters!

Charm++ Parallel

Programming System

• Platform independent

• Both shared and distributed memory

• Architecture abstraction

• Programmer productivity

From charm++ site: http://charm.cs.uiuc.edu/research/charm/

Charm++ Parallel

Programming System

• Communications originally implemented

with message passing

• Even on SMP machines

• Currently, uses optimizations for SMP

systems

• Chao Mei et al., “Optimizing a parallel runtime system

for multicore clusters: a case study”, in TG ‘10

Charm++ & NUMA

• How these optimizations work on

NUMA machines?

• How can we use knowledge about the

NUMA system to improve performance

on Charm++?

Charm++ & NUMA

• How these optimizations work on

NUMA machines?

• Our evaluation

• How can we use knowledge about the

NUMA system to improve performance

on Charm++?

• NUMA-aware load balancer

Outline

• Introduction

• Performance Evaluation of SMP Optimizations

of Charm++ on NUMA Machines



Evaluation of SMP optimizations

• Different Charm++ versions

• With optimizations

• Without optimizations

• Different architecture compilations (flavors)

• net-linux: distributed memory

• multicore: shared memory

NUMA machines• AMD Opteron

• 8 nodes x 2 cores

• @ 2.2GHz

• 2 MB L2 cache

• 32 GB main memory

• Low latency for local

memory access

• Crossbar

• NUMA factor: 1.2 – 1.5

• Linux 2.6.32.6

Node 6

M5 C12 C13

Node 7

M7 C14 C15

Node 4

M4 C8 C9

Node 5

M5 C10 C11

Node 2

M2 C4 C5

Node 3

M3 C6 C7

Node 0

M0 C0 C1

Node 1

M1 C2 C3

NUMA machines

• Intel Xeon X7560

• 4 nodes x 8 cores

• @ 2.27 GHz

• 24 MB shared L3 cache

• 64 GB main memory

• QuickPath

• NUMA factor: 2 - 2.6

• Linux 2.6.32

Node 2

M2

C22 C23

C20 C21

C18 C19

C16 C17

Node 3

M3

C30 C31

C28 C29

C26 C27

C24 C25

Node 0

M0

C6 C7

C4 C5

C2 C3

C0 C1

Node 1

M1

C14 C15

C12 C13

C10 C11

C8 C9

Experimental setup

• Exclusive access to the machines

• Minimum of 10 executions

• Low standard deviation (< 5%)

• Different numbers of cores

Benchmark: Jacobi2D

• Iterative benchmark

• Computations over 2D matrix

• Communications with 4 neighbors

• Stencil (CPU bound)

• Imbalanced

Jacobi2D on Opteron Machine

0

1

2

3

4

5

6

2 4 8 16

Ave

rag

e i

tera

tio

n t

ime

(s

)

Number of cores

With optim. multicore Without optim. multicore

With optim. net_linux Without optim. net_linux

No sensible

difference

b

e

t

t

e

r

Opteron

Jacobi2D on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(s

)

Number of cores



b

e

t

t

e

r

Xeon

Inside de

error margin

Benchmark: kNeighbor

• Synthetic benchmark

• Completely communication bound

• Each chare communicates with k

neighbors

• k = 3

• Message size = 1024 B

kNeighbor on Opteron Machine

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2 4 8 16

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores



Opteron

b

e

t

t

e

r

Speedup of 9Speedup of 1.2

kNeighbor on Xeon Machine

0

100

200

300

400

500

600

700

800

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores



Xeon

b

e

t

t

e

r

kNeighbor on Xeon Machine

0

100

200

300

400

500

600

700

800

2 4 8 16 32

Ave

rag

e i

tera

tio

n t

ime

(u

s)

Number of cores



Xeon

b

e

t

t

e

r

Speedup of 2

Speedup of 4.7

Partial conclusions

• Times can be have a 50% difference

between Charm++ versions

• Times 90% smaller when using

multicore instead of net-linux

• Impact proportional to the amount of

communications

Outline

• Introduction





NUMA-aware Load Balancer

• Use knowledge about the system

• NUMA-factor among nodes

• Collected through libarchtopo

• Communication history

• No knowledge about the chare’s memory

• Improve performance by reducing

communication latency

• Avoid too many chare migrations


Calculate processors’ load

Sort chares by decreasing load

While there are migratable chares

Pick most loaded chare k

Compute W(k,i) for all processors i

Migrate k for the processor with smaller W(k,i)


W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)


W(k,i) = L(i) +

ɑ*( Communication weight (constant)

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)

Load on candidate processor (core)


W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

)

Intra-core communications

(extended for intra-NUMA node)


W(k,i) = L(i) +

ɑ*(

- M(k,i)

+ Σ j=1..N, j!=i (M(k,j)*NF(j,i))

) Inter-core communications

(extended for inter-NUMA node)

NUMA factor

Load Balancer Evaluation

• Benchmarks

• Imbalance

• Jacobi2D

• Poisson3D

• Comparison with different load balancers

• GreedyLB

• GreedyCommLB

Benchmark: Imbalance

• By Isaac Dooley

• Based on Fractography3D


• Imbalance increases with computations

• Computations over 2D array of chares


Imbalance on Opteron Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

To

tal ti

me

sp

ee

du

p

Number of cores

No LB GreedyCommLB GreedyLB NumaLB

Opteron

b

e

t

t

e

r

15%5%

Imbalance on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

To

tal ti

me

sp

ee

du

p

Number of cores


Xeon

b

e

t

t

e

r

~5% (inside error margin)

Benchmark: Jacobi2D


• Computations over 2D matrix


• Stencil (CPU bound)

• Imbalaced

Jacobi2D on Opteron Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

Ite

rati

on

tim

e s

pe

ed

up

Number of cores


Opteron

b

e

t

t

e

r

Jacobi2D on Xeon Machine

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

Ite

rati

on

tim

e s

pe

ed

up

Number of cores


Xeon

b

e

t

t

e

r

17%

3%

Benchmark: Poisson3D

• By Xavier Besseron and Thierry Gautier

• Solves the Poisson equation on a 3D

domain

• Parallelized by domain decomposition

• Well balanced

Poisson3D on Opteron Machine

0,900 0,7970,999 0,989

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16

To

tal ti

me

sp

ee

du

p

Number of cores


Opteron

b

e

t

t

e

r

Poisson3D on Xeon Machine

0,914 0,8370,711

0,995 0,993 0,983

0,0

0,5

1,0

1,5

2,0

2,5

3,0

8 16 32

To

tal ti

me

sp

ee

du

p

Number of cores


Xeon

b

e

t

t

e

r

GreedyLB performance decreased due to

migrations overhead

Outline

• Introduction





Conclusion

• SMP optimizations do affect the

performance on NUMA machines

• Up to 50% between versions and 90%

between architecture-specific compilations

• Gains with NUMA LB

• Speedups of up to 2.8 (compared to no LB)

• Performance near GreedyLB

• Avoid migrations

Future Work

• Evolution of NUMA-aware LB

• Consider topology

• Number of hops

• Cache hierarchy

• Memory per chare

• Improve NUMA information discovery

• Initialization overheads

• Run experiments with communication intensive

benchmarks

• Interface for a memory LB?

Charm++ on NUMA Platforms:

the impact of SMP Optimizations

and a NUMA-aware Load Balancer

Thank you

Documents

NCSA Wiki - ï¿½ï¿½Prï¿½sentation PowerPoint · 2010. 11. 23. · Title: ï¿½ï¿½Prï¿½sentation PowerPoint Author: Franck Cappello Created Date: 11/21/2010 5:24:18