Upload
vuongtruc
View
217
Download
0
Embed Size (px)
Citation preview
SFXC Features• FX software correlator
• Data formats: Mark4, VLBA, Mark5B, VDIF
• Delay model: CALC10 (same as Mark4@JIVE), or external
• WOLA: Hann, Hamming, Cosine, Rectangular
• VEX driven, with JSON configuration file
• Implemented using MPI
• Optionally uses commercial Intel IPP library
• Pulsar binning
• Incoherent or coherent de-dispersion
• Mixed bandwidth correlation / zoom bands
• Phased array mode
Paper & Source CodeA. Keimpema, M.M. Kettenis et al, The SFXC software correlator for very long baseline interferometry: algorithms and implementation, Experimental Astronomy, Volume 39, Issue 2, pp.259-279
arXiv:1502.00467
Source code available through nightly mirror of svn repository.See: http://www.jive.nl/jivewiki/doku.php?id=sfxc
Time domain Frequency domain
IntegerDelay
FractionalDelay
WOLACross
Multiplication
Accumulation &Spectral Avg’ing
Fringe Rotation
FFT
FFT
FFT-1
SFXC Algorithm
Sergei Pogrebenko
Algorithm ScalingUnpacking ~Ns Δν
Delay correction ~Ns Δν
Fringe Rotation ~Ns Δν
FFT ~Ns log Nf Δν
X ~Ns2 Δν
Building up the clusterYear Node
s CPUs SIMD Clockspeed (GHz) Cores Stations
@1Gb/s
2010 16 Xeon E5520 SSE 4.2 2.3 128 ?
2011 16 Xeon E5620 SSE 4.2 2.4 128 7
2012 8 Xeon E5-2670 AVX 2.6 128 13
2015 4 Xeon E5-2630v3 AVX2 2.4 64 16?
2016 4 Xeon E5-2630v3 AVX2 2.4 64 19?
Total 48 512
NEXPReS: 4Gb/s e-VLBI
SFXC SFXC SFXC SFXC
Ef Hh On Ys
“Harrobox” jive5ab
128 MHz 128 MHz
128 MHz128 MHz
Mark5B
VDIF
DBBC FILA10G
Distributed CorrelationNEXPReS Demo (june 2013)
• Two sites: PSNC and JIVE • Simulated data from 4 stations:
• 512 Mbit/s • Using jive5ab as corner turner • VDIF/VTP for data transport
Use dedicated bridge nodes for real-time data input
e-EVN at 2Gb/s• DBBC2 + FILA10G
• With dbbcproxy to allow correlator control the data flow
• FILA10G corner turning • 8k packets vs. 2k packets
• Toss-up between per-packet overhead and corner turning
• Network performance: bps vs pps
Internal Network Bottleneck
HP 8212zl
HP 5412zl
10 Gb/s10 Gb/s
cluster node
cluster node
cluster node
10 Gb/s
10 Gb/s 1Gb/s
Multiple Phase CentersRadcliffe et. al.
699 sources in GOODS-N
Two areas: • 15’ central area • 20’ outer annulus
Multi-source Self Calibration
arXiv:1601.04452 8k spectral channels for 4% smearing
Multiple Phase Centers
• Decorrelation 2.5% at 4 arcmin with10000 km baseline
• Subint 50msFFT size 2048
Accelerators
Romein, 2016
128 256 384 512 640 768
#receivers
0
5
10
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40FirePro S10000
Fury X
DSP (float)
DSP (int)
(a) FIR filter
128 256 384 512 640 768
#receivers
0
1
2
3
4
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(b) FFT
128 256 384 512 640 768
#receivers
0.0
0.5
1.0
1.5
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
Tesla K40FirePro S10000
Fury X
GTX 1080
Titan X
DSP (float)
DSP (int)
(c) delay/bandpass/transpose
128 256 384 512 640 768
#receivers
0
10
20
30
40
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(d) correlate
128 256 384 512 640 768
#receivers
0
10
20
30
GF
LO
PS
/W
Xeon Phi 7120X2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(e) full pipeline
Fig. 11: Energy efficiency of the individual kernels, as well as the full pipeline.
matching [13]. These papers com-pare performance, but none of themassesses energy efficiency.
VIII.CONCLUSIONS AND FUTURE WORK
The take-away message of thisstudy is that GPUs are the most (en-ergy) efficient accelerators — at leastfor a radio-astronomical correlatorpipeline — and that the differencesin performance and energy efficiencybetween the accelerators are large. Itis difficult to obtain good memoryperformance on the Xeon Phi. The DSP has architecturalfeatures that support signal processing (like circular address-ing, loop pipelining, some complex instructions, and multi-level DMA controllers), but these features make it difficult toprogram and do not make it more energy efficient than GPUs.The custom-built power sensors that we introduced in Sec. VIare highly useful to analyze energy efficiency, at millisecondtimescales.
Future work includes a comparison with FPGAs. In aseparate work, we study how to efficiently create sky imagesfrom correlated data on these accelerators.
ACKNOWLEDGMENTS
This work is supported by the NWO through Open Com-petition (Triple-A) and NWO-M grants (DAS-4, DAS-5 [14]),and by the Dutch Ministry of EZ and the province of Drenthethrough the Dome grant. We thank AMD, Intel, and NVIDIAfor their support and for providing us with hardware.
REFERENCES
[1] J. Romein, P. Broekema, J. Mol, and R. van Nieuwpoort, “The LOFARCorrelator: Implementation and Performance Analysis,” in PPoPP’10,Bangalore, India, January 2010, pp. 169–178.
[2] S. Williams, A. Waterman, and D. Patterson, “Roofline: An InsightfulVisual Performance Model for Multicore Architectures,” vol. 52, no. 4,pp. 65–76, April 2009.
[3] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. Varbanescu, “Test-Driving Intel Xeon Phi,” in ICPE’14, Dublin, Ireland, March 2014, pp.137–148.
[4] J. Laros III, P. Pokorny, and D. DeBonis, “PowerInsight – A CommodityPower Measurement Capability,” in Int. Workshop on Power Measure-ment and Profiling, Arlington Va, June 2013.
[5] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-oriented Tool Suite for x86 Multicore Environments.” inICPPW’10, San Diego, CA, September 2010, pp. 207–216.
[6] D. Price, M. Clark, B. Barsdell, R. Babich, and L. Greenhill, “Optimiz-ing performance-per-watt on GPUs in high performance computing,”Computer Science – Research and Development, pp. 1–9, Sept. 2015.
[7] R. van Nieuwpoort and J. Romein, “Correlating Radio AstronomySignals with Many-Core Hardware,” Int. Journal of Parallel Processing,vol. 39, no. 1, pp. 88–114, February 2011.
[8] ——, “Building Correlators with Many-Core Hardware,” IEEE SignalProcessing Magazine, vol. 27, no. 2, pp. 108–117, March 2010.
[9] M. Clark, P. Plante, and L. Greenhill, “Accelerating Radio AstronomyCross-Correlation with Graphics Processing Units,” Int. Jour. of HighPerformance Computing Applications, vol. 27, no. 2, pp. 178–192, 2013.
[10] L. Fiorin, E. Vermij, J. Lunteren, R. Jongerius, and C. Hagleitner, “AnEnergy-Efficient Custom Architecture for the SKA1-low Central SignalProcessor,” in CF’15, Ischia, Italy, May 2015, pp. 5:1–5:8.
[11] A. Sclocco, H. Bal, J. Hessels, J. Leeuwen, and R. van Nieuwpoort,“Auto-Tuning Dedispersion for Many-Core Accelerators,” in IPDPS’14,Phoenix, AZ, May 2014, pp. 952–961.
[12] G. Teodoro, T. Kurc, J. Kong, L. Cooper, and J. Saltz, “ComparativePerformance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: ACase Study from Microscopy Image Analysis,” in IPDPS’14, Phoenix,AZ, May 2014, pp. 1063–1072.
[13] T. Tran, Y. Liu, and B. Schmidt, “Bit-parallel Approximate PatternMatching: Kepler GPU versus Xeon Phi,” Parallel Computing, vol. 54,pp. 128–138, November 2015.
[14] H. Bal et al., “A Medium-Scale Distributed System for ComputerScience Research: Infrastructure for the Long Term,” IEEE Computer,vol. 49, no. 5, pp. 54–63, May 2016.
128 256 384 512 640 768
#receivers
0
5
10
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40FirePro S10000
Fury X
DSP (float)
DSP (int)
(a) FIR filter
128 256 384 512 640 768
#receivers
0
1
2
3
4
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(b) FFT
128 256 384 512 640 768
#receivers
0.0
0.5
1.0
1.5
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
Tesla K40FirePro S10000
Fury X
GTX 1080
Titan X
DSP (float)
DSP (int)
(c) delay/bandpass/transpose
128 256 384 512 640 768
#receivers
0
10
20
30
40
GF
LO
PS
/W
Xeon Phi 7120X
2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(d) correlate
128 256 384 512 640 768
#receivers
0
10
20
30
GF
LO
PS
/W
Xeon Phi 7120X2x Xeon E5-2697v3
GTX 1080
Titan X
Tesla K40
FirePro S10000
Fury X
DSP (float)
DSP (int)
(e) full pipeline
Fig. 11: Energy efficiency of the individual kernels, as well as the full pipeline.
matching [13]. These papers com-pare performance, but none of themassesses energy efficiency.
VIII.CONCLUSIONS AND FUTURE WORK
The take-away message of thisstudy is that GPUs are the most (en-ergy) efficient accelerators — at leastfor a radio-astronomical correlatorpipeline — and that the differencesin performance and energy efficiencybetween the accelerators are large. Itis difficult to obtain good memoryperformance on the Xeon Phi. The DSP has architecturalfeatures that support signal processing (like circular address-ing, loop pipelining, some complex instructions, and multi-level DMA controllers), but these features make it difficult toprogram and do not make it more energy efficient than GPUs.The custom-built power sensors that we introduced in Sec. VIare highly useful to analyze energy efficiency, at millisecondtimescales.
Future work includes a comparison with FPGAs. In aseparate work, we study how to efficiently create sky imagesfrom correlated data on these accelerators.
ACKNOWLEDGMENTS
This work is supported by the NWO through Open Com-petition (Triple-A) and NWO-M grants (DAS-4, DAS-5 [14]),and by the Dutch Ministry of EZ and the province of Drenthethrough the Dome grant. We thank AMD, Intel, and NVIDIAfor their support and for providing us with hardware.
REFERENCES
[1] J. Romein, P. Broekema, J. Mol, and R. van Nieuwpoort, “The LOFARCorrelator: Implementation and Performance Analysis,” in PPoPP’10,Bangalore, India, January 2010, pp. 169–178.
[2] S. Williams, A. Waterman, and D. Patterson, “Roofline: An InsightfulVisual Performance Model for Multicore Architectures,” vol. 52, no. 4,pp. 65–76, April 2009.
[3] J. Fang, H. Sips, L. Zhang, C. Xu, Y. Che, and A. Varbanescu, “Test-Driving Intel Xeon Phi,” in ICPE’14, Dublin, Ireland, March 2014, pp.137–148.
[4] J. Laros III, P. Pokorny, and D. DeBonis, “PowerInsight – A CommodityPower Measurement Capability,” in Int. Workshop on Power Measure-ment and Profiling, Arlington Va, June 2013.
[5] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-oriented Tool Suite for x86 Multicore Environments.” inICPPW’10, San Diego, CA, September 2010, pp. 207–216.
[6] D. Price, M. Clark, B. Barsdell, R. Babich, and L. Greenhill, “Optimiz-ing performance-per-watt on GPUs in high performance computing,”Computer Science – Research and Development, pp. 1–9, Sept. 2015.
[7] R. van Nieuwpoort and J. Romein, “Correlating Radio AstronomySignals with Many-Core Hardware,” Int. Journal of Parallel Processing,vol. 39, no. 1, pp. 88–114, February 2011.
[8] ——, “Building Correlators with Many-Core Hardware,” IEEE SignalProcessing Magazine, vol. 27, no. 2, pp. 108–117, March 2010.
[9] M. Clark, P. Plante, and L. Greenhill, “Accelerating Radio AstronomyCross-Correlation with Graphics Processing Units,” Int. Jour. of HighPerformance Computing Applications, vol. 27, no. 2, pp. 178–192, 2013.
[10] L. Fiorin, E. Vermij, J. Lunteren, R. Jongerius, and C. Hagleitner, “AnEnergy-Efficient Custom Architecture for the SKA1-low Central SignalProcessor,” in CF’15, Ischia, Italy, May 2015, pp. 5:1–5:8.
[11] A. Sclocco, H. Bal, J. Hessels, J. Leeuwen, and R. van Nieuwpoort,“Auto-Tuning Dedispersion for Many-Core Accelerators,” in IPDPS’14,Phoenix, AZ, May 2014, pp. 952–961.
[12] G. Teodoro, T. Kurc, J. Kong, L. Cooper, and J. Saltz, “ComparativePerformance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: ACase Study from Microscopy Image Analysis,” in IPDPS’14, Phoenix,AZ, May 2014, pp. 1063–1072.
[13] T. Tran, Y. Liu, and B. Schmidt, “Bit-parallel Approximate PatternMatching: Kepler GPU versus Xeon Phi,” Parallel Computing, vol. 54,pp. 128–138, November 2015.
[14] H. Bal et al., “A Medium-Scale Distributed System for ComputerScience Research: Infrastructure for the Long Term,” IEEE Computer,vol. 49, no. 5, pp. 54–63, May 2016.
Conclusions• Division into sub-bands helps scaling our
correlates
• Coordination of sample rates is still important
• SIMD improvements mean correlates still benefit from “Moore’s Law”
• Accelerators (GPUs, Xeon Phi) may help if number of stations increases