21
Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Embed Size (px)

Citation preview

Page 1: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Can You Get Performance from

Xeon Phi Easily?

Lessons Learned from Two Real

Cases

Page 2: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Objective

• Check the amount of work to use Intel

Xeon Phi.

• Minimal modifications using only pragmas.

• Two applications: – CalcunetW. Test MKL Libraries.

– GammaMaps. Test pragmas.

• Two modes: – Native: Only compiled to execute on Xeon Phi

– Offload: Uses Host+Xeon Phi

Page 3: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

CalcuNetw: Calculate Measurements in Complex Networks

• Complex networks, consisting of sets of nodes or vertices joined together in pairs by links or edges.

• Application Calculates for each network: – Subgraph Centrality (SC): characterizes the

participation of each node in all subgraphs in a network.

– SC odd: account only paths of long odd

– SC even: account only paths of long even

– Bipartivity: Is a proportion of even to total number of closed walks in the network.

– Network Communicability for Connected Nodes: C(p,q): Measures how well communicated are two nodes in the network.

– Network Communicability C(G): is the mean of all the C(p,q),

Mouriño J.C., Estrada E., Gomez A. “ CalcuNetw: Calculate Measurements in Complex Networks ”,Informe Técnico

CESGA-2005-003

Page 4: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

CalcuNetW

Page 5: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

GammaMaps: A figure-of-merit in Radiation

Therapy

X

Y

Z

Dose in voxel i,j,k

X

Y

Z

Page 6: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

GammaMaps: A figure-of-merit in

Radiation Therapy Read

Doses

Initialise and

normalise

Compute

Gamma

Store

Gamma

• Application in FORTRAN 90

• Parallelised using OpenMP

• Geometric algorithm*

• 512 x 512 x 128 = 33,554,432

voxels

• Auto-vectorization

• Pragmas for offload

* T. Ju, T. Simpson, J. O. Deasy, and D. A. Low, “Geometric interpretation of the γ dose distribution

comparison technique: Interpolation-free calculation,” Medical Physics, vol. 35, no. 3, p. 879, 2008.

Page 7: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Results of Experiments

Page 8: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Platform Host

CPU Model Intel(R) Xeon(R) CPU E5-2680

0 @ 2.70GHz

Nr. of cores 16

Memory 32788 MB

Operating System Linux 2.6.32-279.el6.x86_64

Compiler Version 2013U2 Intel Xeon Phi

Model Beta0 Engineering Sample

Nr. of cores 61 at 1.09GHz

Memory 7936 MB

Operating System MPSS Gold U1

Compiler Version 2013U2

GDDR Technology GDDR5

GDDR Frecuency 2750000 KHz

• Remote

access to

Intel systems

• Feb. 2013

Page 9: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

COMPACT - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 1 2 3 4 5 6 7

Intel Xeon Phi Affinity Policies

SCATTER - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 4 1 5 2 6 3 7

BALANCED - FINE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

0 1 2 3 4 5 6 7

BALANCED - CORE

C1 C2 C3 C4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

H

T

1

H

T

2

H

T

3

H

T

4

{0,1} {2,3} {4,5} {6,7}

• TYPE – Compact

– Scatter

– Balanced

• Granularity – Fine or Thread

– Core

Page 10: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Results for CalcunetW

Page 11: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

CalcunetW

Page 12: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

CalcunetW

Page 13: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

CalcunetW

Page 14: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Results for GammaMaps

Page 15: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

GammaMaps

Page 16: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Host

0

200

400

600

800

1000

1200

1400

0 5 10 15 20

Ela

psed

Tim

e (

s)

Nr. of Threads

Host

local-compact-core

local-compact-fine

local-scatter-fine

local-scatter-core

Page 17: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

GammaMaps

Page 18: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Xeon Phi poor I/O

Page 19: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Conclusions

• Using MKL library is easy and does not

require changes in the code.

• Easy pragmas on code permit fast usage

• I/O performance issues in Xeon Phi

• 1 Xeon Phi ~ 1 Xeon E5-2680

• Improve performance requires additional

work.

Page 20: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Acknowledge

The authors would like to thank Intel for

providing access to Intel Xeon Phi

coprocessor.

Page 21: Can You Get Performance from Xeon Phi Easily? Lessons Learned from Two Real Cases

Questions

Andrés Gómez

José Carlos Mouriño

Carmen Cotelo

Aurelio Rodríguez

The TEAM