16
Scaling the SFXC Software Correlator Mark Kettenis IVTW 2016, MIT Haystack Observatory

Scaling the SFXC Software Correlator - Haystack … · 30 GFLOPS/W Xeon Phi 7120X GTX 1080 Titan X FirePro S10000 Fury X DSP ... The custom-built power sensors that we introduced

Embed Size (px)

Citation preview

Scaling the SFXC Software Correlator

Mark Kettenis IVTW 2016, MIT Haystack Observatory

SFXC Features• FX software correlator

• Data formats: Mark4, VLBA, Mark5B, VDIF

• Delay model: CALC10 (same as Mark4@JIVE), or external

• WOLA: Hann, Hamming, Cosine, Rectangular

• VEX driven, with JSON configuration file

• Implemented using MPI

• Optionally uses commercial Intel IPP library

• Pulsar binning

• Incoherent or coherent de-dispersion

• Mixed bandwidth correlation / zoom bands

• Phased array mode

Paper & Source CodeA. Keimpema, M.M. Kettenis et al, The SFXC software correlator for very long baseline interferometry: algorithms and implementation, Experimental Astronomy, Volume 39, Issue 2, pp.259-279

arXiv:1502.00467

Source code available through nightly mirror of svn repository.See: http://www.jive.nl/jivewiki/doku.php?id=sfxc

Time domain Frequency domain

IntegerDelay

FractionalDelay

WOLACross

Multiplication

Accumulation &Spectral Avg’ing

Fringe Rotation

FFT

FFT

FFT-1

SFXC Algorithm

Sergei Pogrebenko

Algorithm ScalingUnpacking ~Ns Δν

Delay correction ~Ns Δν

Fringe Rotation ~Ns Δν

FFT ~Ns log Nf Δν

X ~Ns2 Δν

Building up the clusterYear Node

s CPUs SIMD Clockspeed (GHz) Cores Stations

@1Gb/s

2010 16 Xeon E5520 SSE 4.2 2.3 128 ?

2011 16 Xeon E5620 SSE 4.2 2.4 128 7

2012 8 Xeon E5-2670 AVX 2.6 128 13

2015 4 Xeon E5-2630v3 AVX2 2.4 64 16?

2016 4 Xeon E5-2630v3 AVX2 2.4 64 19?

Total 48 512

Parallelization• Frequency axis

• Polarisation (only if no XPols)

• Time axis

NEXPReS: 4Gb/s e-VLBI

SFXC SFXC SFXC SFXC

Ef Hh On Ys

“Harrobox” jive5ab

128 MHz 128 MHz

128 MHz128 MHz

Mark5B

VDIF

DBBC FILA10G

Distributed CorrelationNEXPReS Demo (june 2013)

• Two sites: PSNC and JIVE • Simulated data from 4 stations:

• 512 Mbit/s • Using jive5ab as corner turner • VDIF/VTP for data transport

Use dedicated bridge nodes for real-time data input

e-EVN at 2Gb/s• DBBC2 + FILA10G

• With dbbcproxy to allow correlator control the data flow

• FILA10G corner turning • 8k packets vs. 2k packets

• Toss-up between per-packet overhead and corner turning

• Network performance: bps vs pps

Internal Network Bottleneck

HP 8212zl

HP 5412zl

10 Gb/s10 Gb/s

cluster node

cluster node

cluster node

10 Gb/s

10 Gb/s 1Gb/s

Multiple Phase CentersRadcliffe et. al.

699 sources in GOODS-N

Two areas: • 15’ central area • 20’ outer annulus

Multi-source Self Calibration

arXiv:1601.04452 8k spectral channels for 4% smearing

Multiple Phase Centers

• Decorrelation 2.5% at 4 arcmin with10000 km baseline

• Subint 50msFFT size 2048

Accelerators

Romein, 2016

128 256 384 512 640 768

#receivers

0

5

10

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40FirePro S10000

Fury X

DSP (float)

DSP (int)

(a) FIR filter

128 256 384 512 640 768

#receivers

0

1

2

3

4

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(b) FFT

128 256 384 512 640 768

#receivers

0.0

0.5

1.0

1.5

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

Tesla K40FirePro S10000

Fury X

GTX 1080

Titan X

DSP (float)

DSP (int)

(c) delay/bandpass/transpose

128 256 384 512 640 768

#receivers

0

10

20

30

40

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(d) correlate

128 256 384 512 640 768

#receivers

0

10

20

30

GF

LO

PS

/W

Xeon Phi 7120X2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(e) full pipeline

Fig. 11: Energy efficiency of the individual kernels, as well as the full pipeline.

matching [13]. These papers com-pare performance, but none of themassesses energy efficiency.

VIII.CONCLUSIONS AND FUTURE WORK

The take-away message of thisstudy is that GPUs are the most (en-ergy) efficient accelerators — at leastfor a radio-astronomical correlatorpipeline — and that the differencesin performance and energy efficiencybetween the accelerators are large. Itis difficult to obtain good memoryperformance on the Xeon Phi. The DSP has architecturalfeatures that support signal processing (like circular address-ing, loop pipelining, some complex instructions, and multi-level DMA controllers), but these features make it difficult toprogram and do not make it more energy efficient than GPUs.The custom-built power sensors that we introduced in Sec. VIare highly useful to analyze energy efficiency, at millisecondtimescales.

Future work includes a comparison with FPGAs. In aseparate work, we study how to efficiently create sky imagesfrom correlated data on these accelerators.

ACKNOWLEDGMENTS

This work is supported by the NWO through Open Com-petition (Triple-A) and NWO-M grants (DAS-4, DAS-5 [14]),and by the Dutch Ministry of EZ and the province of Drenthethrough the Dome grant. We thank AMD, Intel, and NVIDIAfor their support and for providing us with hardware.

REFERENCES

[1] J. Romein, P. Broekema, J. Mol, and R. van Nieuwpoort, “The LOFARCorrelator: Implementation and Performance Analysis,” in PPoPP’10,Bangalore, India, January 2010, pp. 169–178.

[2] S. Williams, A. Waterman, and D. Patterson, “Roofline: An InsightfulVisual Performance Model for Multicore Architectures,” vol. 52, no. 4,pp. 65–76, April 2009.

[3] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. Varbanescu, “Test-Driving Intel Xeon Phi,” in ICPE’14, Dublin, Ireland, March 2014, pp.137–148.

[4] J. Laros III, P. Pokorny, and D. DeBonis, “PowerInsight – A CommodityPower Measurement Capability,” in Int. Workshop on Power Measure-ment and Profiling, Arlington Va, June 2013.

[5] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-oriented Tool Suite for x86 Multicore Environments.” inICPPW’10, San Diego, CA, September 2010, pp. 207–216.

[6] D. Price, M. Clark, B. Barsdell, R. Babich, and L. Greenhill, “Optimiz-ing performance-per-watt on GPUs in high performance computing,”Computer Science – Research and Development, pp. 1–9, Sept. 2015.

[7] R. van Nieuwpoort and J. Romein, “Correlating Radio AstronomySignals with Many-Core Hardware,” Int. Journal of Parallel Processing,vol. 39, no. 1, pp. 88–114, February 2011.

[8] ——, “Building Correlators with Many-Core Hardware,” IEEE SignalProcessing Magazine, vol. 27, no. 2, pp. 108–117, March 2010.

[9] M. Clark, P. Plante, and L. Greenhill, “Accelerating Radio AstronomyCross-Correlation with Graphics Processing Units,” Int. Jour. of HighPerformance Computing Applications, vol. 27, no. 2, pp. 178–192, 2013.

[10] L. Fiorin, E. Vermij, J. Lunteren, R. Jongerius, and C. Hagleitner, “AnEnergy-Efficient Custom Architecture for the SKA1-low Central SignalProcessor,” in CF’15, Ischia, Italy, May 2015, pp. 5:1–5:8.

[11] A. Sclocco, H. Bal, J. Hessels, J. Leeuwen, and R. van Nieuwpoort,“Auto-Tuning Dedispersion for Many-Core Accelerators,” in IPDPS’14,Phoenix, AZ, May 2014, pp. 952–961.

[12] G. Teodoro, T. Kurc, J. Kong, L. Cooper, and J. Saltz, “ComparativePerformance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: ACase Study from Microscopy Image Analysis,” in IPDPS’14, Phoenix,AZ, May 2014, pp. 1063–1072.

[13] T. Tran, Y. Liu, and B. Schmidt, “Bit-parallel Approximate PatternMatching: Kepler GPU versus Xeon Phi,” Parallel Computing, vol. 54,pp. 128–138, November 2015.

[14] H. Bal et al., “A Medium-Scale Distributed System for ComputerScience Research: Infrastructure for the Long Term,” IEEE Computer,vol. 49, no. 5, pp. 54–63, May 2016.

128 256 384 512 640 768

#receivers

0

5

10

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40FirePro S10000

Fury X

DSP (float)

DSP (int)

(a) FIR filter

128 256 384 512 640 768

#receivers

0

1

2

3

4

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(b) FFT

128 256 384 512 640 768

#receivers

0.0

0.5

1.0

1.5

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

Tesla K40FirePro S10000

Fury X

GTX 1080

Titan X

DSP (float)

DSP (int)

(c) delay/bandpass/transpose

128 256 384 512 640 768

#receivers

0

10

20

30

40

GF

LO

PS

/W

Xeon Phi 7120X

2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(d) correlate

128 256 384 512 640 768

#receivers

0

10

20

30

GF

LO

PS

/W

Xeon Phi 7120X2x Xeon E5-2697v3

GTX 1080

Titan X

Tesla K40

FirePro S10000

Fury X

DSP (float)

DSP (int)

(e) full pipeline

Fig. 11: Energy efficiency of the individual kernels, as well as the full pipeline.

matching [13]. These papers com-pare performance, but none of themassesses energy efficiency.

VIII.CONCLUSIONS AND FUTURE WORK

The take-away message of thisstudy is that GPUs are the most (en-ergy) efficient accelerators — at leastfor a radio-astronomical correlatorpipeline — and that the differencesin performance and energy efficiencybetween the accelerators are large. Itis difficult to obtain good memoryperformance on the Xeon Phi. The DSP has architecturalfeatures that support signal processing (like circular address-ing, loop pipelining, some complex instructions, and multi-level DMA controllers), but these features make it difficult toprogram and do not make it more energy efficient than GPUs.The custom-built power sensors that we introduced in Sec. VIare highly useful to analyze energy efficiency, at millisecondtimescales.

Future work includes a comparison with FPGAs. In aseparate work, we study how to efficiently create sky imagesfrom correlated data on these accelerators.

ACKNOWLEDGMENTS

This work is supported by the NWO through Open Com-petition (Triple-A) and NWO-M grants (DAS-4, DAS-5 [14]),and by the Dutch Ministry of EZ and the province of Drenthethrough the Dome grant. We thank AMD, Intel, and NVIDIAfor their support and for providing us with hardware.

REFERENCES

[1] J. Romein, P. Broekema, J. Mol, and R. van Nieuwpoort, “The LOFARCorrelator: Implementation and Performance Analysis,” in PPoPP’10,Bangalore, India, January 2010, pp. 169–178.

[2] S. Williams, A. Waterman, and D. Patterson, “Roofline: An InsightfulVisual Performance Model for Multicore Architectures,” vol. 52, no. 4,pp. 65–76, April 2009.

[3] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. Varbanescu, “Test-Driving Intel Xeon Phi,” in ICPE’14, Dublin, Ireland, March 2014, pp.137–148.

[4] J. Laros III, P. Pokorny, and D. DeBonis, “PowerInsight – A CommodityPower Measurement Capability,” in Int. Workshop on Power Measure-ment and Profiling, Arlington Va, June 2013.

[5] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-oriented Tool Suite for x86 Multicore Environments.” inICPPW’10, San Diego, CA, September 2010, pp. 207–216.

[6] D. Price, M. Clark, B. Barsdell, R. Babich, and L. Greenhill, “Optimiz-ing performance-per-watt on GPUs in high performance computing,”Computer Science – Research and Development, pp. 1–9, Sept. 2015.

[7] R. van Nieuwpoort and J. Romein, “Correlating Radio AstronomySignals with Many-Core Hardware,” Int. Journal of Parallel Processing,vol. 39, no. 1, pp. 88–114, February 2011.

[8] ——, “Building Correlators with Many-Core Hardware,” IEEE SignalProcessing Magazine, vol. 27, no. 2, pp. 108–117, March 2010.

[9] M. Clark, P. Plante, and L. Greenhill, “Accelerating Radio AstronomyCross-Correlation with Graphics Processing Units,” Int. Jour. of HighPerformance Computing Applications, vol. 27, no. 2, pp. 178–192, 2013.

[10] L. Fiorin, E. Vermij, J. Lunteren, R. Jongerius, and C. Hagleitner, “AnEnergy-Efficient Custom Architecture for the SKA1-low Central SignalProcessor,” in CF’15, Ischia, Italy, May 2015, pp. 5:1–5:8.

[11] A. Sclocco, H. Bal, J. Hessels, J. Leeuwen, and R. van Nieuwpoort,“Auto-Tuning Dedispersion for Many-Core Accelerators,” in IPDPS’14,Phoenix, AZ, May 2014, pp. 952–961.

[12] G. Teodoro, T. Kurc, J. Kong, L. Cooper, and J. Saltz, “ComparativePerformance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: ACase Study from Microscopy Image Analysis,” in IPDPS’14, Phoenix,AZ, May 2014, pp. 1063–1072.

[13] T. Tran, Y. Liu, and B. Schmidt, “Bit-parallel Approximate PatternMatching: Kepler GPU versus Xeon Phi,” Parallel Computing, vol. 54,pp. 128–138, November 2015.

[14] H. Bal et al., “A Medium-Scale Distributed System for ComputerScience Research: Infrastructure for the Long Term,” IEEE Computer,vol. 49, no. 5, pp. 54–63, May 2016.

Conclusions• Division into sub-bands helps scaling our

correlates

• Coordination of sample rates is still important

• SIMD improvements mean correlates still benefit from “Moore’s Law”

• Accelerators (GPUs, Xeon Phi) may help if number of stations increases