Signal Processing on GPUs for Radio Telescopes | GTC...

March 18-21, 2013March 18-21, 2013 1GTC'13GTC'13

Signal Processing on GPUsfor Radio Telescopes

John W. RomeinJohn W. Romein

Netherlands Institute for Radio Astronomy (ASTRON)Dwingeloo, the Netherlands

Overview

radio telescopes motivation processing pipelines signal-processing algorithms

filter, correlator, beam forming, etc. performance

The LOFAR Radio Telescope

largest low-frequency telescope

no dishes

distributed sensor network

~85.000 receivers

LOFAR: A Software Telescope

all-sky view antennas ➜ new science opportunities different observation modes require flexibility

digitally steered concurrent observations supercomputer real time

standard imaging

pulsar survey

known pulsar

epoch of re-ionization

transients

ultra-high energy particles

LOFAR SuperTerp

What's Next? The Square Kilometre Array

world-wide effort unprecedented size

3,000 dishes + aperture array South Africa + Australia

2016‒2019: 10% SKA 2020‒2023: 100% SKA

exascale computing; petascale I/O O(104)‒O(105) > LOFAR

many challenges!

Our Efforts

build/maintain/operate current telescopes (LOFAR, WSRT) research for SKA

Motivation

1)bring GPU technology to LOFAR

2)do accelerator research for the SKA

ASTRON/IBM Dome

P1 algorithms & machines

P2 nano-photonics

P3 access patterns

P4 microservers

P5 accelerators

P6 compressed sampling

P7 real-time communication

research technologies to develop the SKA green computing nano-photonics data & streaming

Accelerator Research

accelerators for (radio-astronomical) signal processing GPUs, Xeon Phi, BG/Q, FPGAs, ...

fundamental understanding of accelerators which properties make architecture (in)efficient? I/O-compute balance? energy efficiency? programmability? architecture-(in)dependent optimizations? devise new algorithms? generic approach to program accelerators, or ad-hoc?

applications:

1) LOFAR2) SKA

LOFAR Data Processing

developing new GPU-based system

Correlator Processing Overview

More Detail

complex piece of software! several pipelines

Implement GPU Kernels

killing two birds

1) develop new LOFAR correlator2) code base for accelerator research

Why We Use OpenCL

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;

Why We Are Disappointed

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;❑ NVIDIA dropped OpenCL support in

development tools

❑ visual profiler

NVIDIA Tesla K10 AMD FirePro S10000 Intel Xeon Phi

3072 3584 976 FPUs

4.58 5.91 2.14 SP TFLOPS

320 (ECC off) 480 352 GB/s

225 375 300 W

FirePro S10000:

driver/VBIOS/hardware (?) issues cannot always overlap compute‒PCIe I/O

Xeon Phi:

immature results

From Feasibility Study to Production Software

concentrated mostly on GPU kernels CPU parts: now/later

Implemented Kernels

most important: filter correlator beam former

Processing Pipelines

imaging mode

sky images pulsar modes

in development UHEP mode

detect Ultra-High Energy Particles experimental

imaging mode

Imaging (Correlator) Pipeline

four GPU kernels

Poly-Phase Filter (PPF) Bank

splits frequency band into channels like prism

trades time for frequency resolution

Poly-Phase Filter (PPF) Bank

FIR filter + FFT

1) Finite Impulse Response (FIR) Filter

history & weights (in registers)

no physical shift reorder output for next kernel

uncoalesced writes many FMAs

operational intensity = 6.4 FLOPs / byte

FIR Filter Performance

0 20 40 60 800

2 FirePro S10000Tesla K10

#stations

0 20 40 60 800

FirePro S10000Tesla K10

#stations

K10: drop @41 stations ➜ small #threads

2) FFT

1D complex ➜ complex typically: 64‒256 tweaked “Apple” FFT library

2) FFT Performance

0 20 40 60 800

#stations

0 20 40 60 800

#stations

memory I/O bound

combined kernel: clock correction delay compensation bandpass correction transpose

reduces #device memory accesses

3) Phase & BandPass Correction

3a) Clock Correction

stations: shared clock corrects cable length errors merge with next step (phase delay)

3b) Delay Compensation (a.k.a. Tracking)

track observed source delay telescope data

delay varies due to earth rotation shift samples remainder: rotate phase (= cmul)

0.75 FLOPs / byte

3c) BandPass Correction

0 64 128 192 2560.0

frequency channel

powers in channels unequal

artifact from station processing multiply by channel-dependent weight

0.125 FLOPs / byte

Phase & BandPass Correction

0 20 40 60 800

#stations

0 20 40 60 800

#stations

☹ 0.75 FLOPs / byte memory I/O bound

4) Correlator Kernel

multiply samples from each station pair

integrate ~1s

4) Correlation Triangle

multiply samples from each station pair

integrate ~1s

4) Correlator Implementation

global memory ➜ local memory 1 thread per station pair (dual pol)

1 FLOP / byte (local mem)

4) Optimized Correlator Implementation

increase register reuse 1 thread: 2x2 stations (dual pol)

2 FLOPs / byte (local mem) also: 3x3, 4x4

Correlator Register Usage

3x3 on K10: excessive spilling

max. registers/thread

Tesla K10 63

FirePro S10000 256

#accumulator registers

2x2 32

3x3 72

Correlator #Threads

cannot use all threads in last warp too few threads ➜ low occupancy too many threads ➜ multiple passes

multiple thread blocks

0 20 40 60 800

#stations

1x12x23x3

max #threads

Tesla K10 1,024

FirePro S10000 256

1x1 vs. 2x2 and 3x3 (FirePro S10000)

0 20 40 60 800

2 1x12x23x3

#stations

0 20 40 60 800

400 1x12x23x3

#stations

use whichever performs best

Correlator Performance

0 20 40 60 800

#stations

0 20 40 60 800

#stations

compute bound S10000: multiple passes

Combined Pipeline

#pragma omp parallel num_threads(2 * nrGPUs) { cl::CommandQueue queue(...); cl::Buffer input(...), output(...), ...; #pragma omp for schedule(dynamic) for (int subband = 0; subband < nrSubbands; subband ++) { receive(input, subband); queue.enqueueWriteBuffer(input, ...); queue.enqueueNDRangeKernel(FIR_filter, ...); queue.enqueueNDRangeKernel(FFT, ...); queue.enqueueNDRangeKernel(delay_bandpass, ...); queue.enqueueNDRangeKernel(correlator, ...); queue.enqueueReadBuffer(output, ...); send(output, subband); } }

2 queues/GPU: overlap I/O & computations

1 host thread/queue: easy model

Correlator PipelineTesla K10 Performance Breakdown

1 10 20 30 40 50 60 70 800

10?CorrelatePhase & BandPass CorrectionFFTFIR filter

#stations

most challenging observation mode 8-bit samples

488 subbands (95.3 MHz)

correlations most expensive

need ~11 Tesla K10s

Large #Stations

➜ different approach

#stations #correlations #threads #thread blocks (K10)

LOFAR ~80 ~12,960 ~820 1

SKA ≤3,000 ≤18,000,000 ≤1,125,000 ≤1,000

LOFAR SKA

Large #Stations: Another Strategy

32x32 stations(16x16 threads)

2x2 stations (dual pol)1 thread

192x192 stations(squares + rectangles)

read samples 32+32 stations

☺ 32 FLOPs / byte

Large #Stations: Correlator Performance

0 128 256 384 512 640 768 8960

#stations

0 128 256 384 512 640 768 8960

#stations

to do: Tesla K20X possibly faster

Pulsar Pipelines

1)search unknown pulsars

2)observe known pulsar many narrow beams

credit: Jason Hessels

Pulsar Pipelines

in development

Coherent Beam Forming

CV beam=∑stat

W beam ,stat∗Sstat

add stations

> sensitivity weighed

compensate Δt change phase different Δt ➜ different beam

many beams from same input

Coherent Beam Forming Implementation

CV beam=∑stat

W beam ,stat∗Sstat

global memory ➜ local memory each thread:

1 beam, 1 polarization station-dependent weights in registers

2 passes of 24 stations

☺ 48 stations, 128 beams ➜ 14.2 FLOPs / byte

Coherent Beam Forming Performance

0 32 64 96 1280

2.5 FirePro S10000Tesla K10

#beams

0 32 64 96 1280

#beams

48 stations sawtooth caused by unused threads

Ultra-High Energy Particle (UHEP) Pipeline

UHEP Pipeline

store raw antenna voltages in 1.3s circular buffer

create ~50 beams on moon

detects 1015‒1020.5 eV particle collisions

trigger in one/few adjacent beams: freeze/dump captured antenna voltages within 1.3s ???

highly experimental pipeline

Image Courtesy L. Viatour

UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!

UHEP Pipeline

UHEP Performance Breakdown

48 stations, 64 beams

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

3 Tesla K10FirePro S10000

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

400Tesla K10FirePro S10000

UHEP Performance Breakdown

3.93 ms data, 48 stations, 488 subbands

most time spent in beam former

I/O overlap

latency ≪ 100 ms

Tesla K10 FirePro S100000

Input (Beam Former Weights)Input (Samples)?TriggerInv. FFTInv. FIRTransposeBeam Forming

A Wild Idea

OpenCL on Top of CUDA Driver API

fool CUDA RTS

CPU: implement our own “platform” (ICD)

OpenCL library calls ➜ CUDA Driver API Calls limited subset (proof of concept)

GPU: use OpenCL ➜ PTX compiler (clc/clang/llvm)

☺ efficient

☹ does not support full language can use:

☺ visual profiler

☺ cuFFT, GPUdirect

OpenCL on Top of CUDA Driver API

Future Work

To Do (LOFAR Correlator)

GPU kernels: dedispersion, flagging pulsar pipelines CPU code:

work distribution network reordering (FDR IB)

>240 Gb/s optimizations monitoring & control ...

To Do (SKA Research)

Xeon Phi OpenCL on FPGA energy efficiency fully understand all results

Conclusions

many signal-processing GPU kernels new LOFAR correlator research for SKA

OpenCL: vendor independent, elegant, but poor support NVIDIA FirePro S10000 faster then Tesla K10, but immature driver high efficiency on most important kernels

Thanks

support: Intel, NVIDIA grants from Dutch national & province governments LOFAR GPU Correlator team: Alexander van Amesfoort, Wouter Klijn, Marcel

Loose, Jan David Mol

Signal Processing on GPUs for Radio Telescopes | GTC...

Documents

Stephen White Gyroresonance emission in FORWARD & Developments in radio telescopes

Radio telescopes in the Netherlands: Advanced RF ......Radio telescopes in the Netherlands: Advanced RF technologies to observe miniscule celestial radiowaves amidst DAB, DVB and 4G

Radio frequency interference mitigation for software telescopes.rob/masters-theses/Tomasz... · 2014-02-11 · make the telescopes more sensitive. New telescopes use thousands of

Evolution of radio telescopes (Braun 1996)

Radio Telescopes - docs.hisparc.nl · We need special telescopes, radio telescopes to ‘see’ them. With these telescopes we can also see the cosmic background radiation (CBR) left

- Radio Telescopes - - NTNUweb.phys.ntnu.no/~stovneng/FY1303/prosjekt/radio_astronomy.pdf · Radio Astronomy - Radio Telescopes - Kristin Borck Andrea Klubicka ... perfect blackbodies,

Telescopes - Wilfrid Laurier University · Radio Telescopes •A radio telescope is like a giant mirror that reflects radio waves to a focus. 24 IR & UV Telescopes •Infrared and

Lectures on radio astronomy: 1leeuwen/course/Radio... · 1. Introduction to radio astronomy; history, examples of research 2. Single element radio telescopes; description and basic

Detailed Notes - Section 09 Astrophysics - AQA Physics A-level · 3.9.1.3 - Single dish radio telescopes, I-R, U-V and X-ray telescopes Telescopes can be set up to detect varying

Single Dish Radio Telescopes - NCRA website

Radio Astronomy and radio telescopes

8 Radio, EUV and X-ray telescopes

Non-Optical Telescopes Astrophysics Lesson 6. Learning Objectives Single dish radio telescopes, I-R, U-V and X-ray telescopes Similarities and differences

Examples of cyclostationary approaches for phased array radio telescopes

Radio Telescopes and Pulsar Measurements · Radio Telescopes and Pulsar Measurements Astrophysical Studies from Radio- to Gamma-Ray Energies Agnieszka Wozna Garching, 08.11.2002

Pulsars: The radio/gamma-ray Connection Prospects for pulsar studies with AGILE and GLAST Synergy with radio telescopes –Timing and follow-up –Radio vs

Radio Telescopes

techical report-Radio telescopes

Panel Options for Large Precision Radio Telescopes ver N · that most large radio and submillimeter telescopes are carefully designed and surface errors that are close to natural

MUELLER MATRIX PARAMETERS FOR RADIO TELESCOPES AND …cds.cern.ch/record/509701/files/0107352.pdf · MUELLER MATRIX PARAMETERS FOR RADIO TELESCOPES AND THEIR OBSERVATIONAL DETERMINATION