SKA NON IMAGING PROCESSING CONCEPT GPU PROCESSING … · channelized to 512 channels using a polyphase filter. The resulting raw complex subbands are 195.3 or 156.2 kHz wide, with

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 1 of 14

Name Designation Affiliation Date Signature

Additional Authors

Submitted by:

A. Karastergiou UOXF 2011‐03‐30

Approved by:

W. Turner Signal Processing Domain Specialist

SPDO 2011‐04‐01

SKA NON IMAGING PROCESSING CONCEPT

DESCRIPTION: GPU PROCESSING FOR REAL‐TIME

ISOLATED RADIO PULSE DETECTION

Document number .................................................................. WP2‐040.130.010‐TD‐001

Revision ........................................................................................................................... 1

Author .................................................................................................... Aris Karastergiou

Date ................................................................................................................ 2011‐04‐01

Status ............................................................................................... Approved for release

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 2 of 14

DOCUMENT HISTORY

Revision Date Of Issue Engineering Change

Number

Comments

A ‐ ‐ First draft release for internal review

DOCUMENT SOFTWARE

Package Version Filename

Wordprocessor MsWord Word 2003 03j1‐wp2‐040 130 010‐td‐001‐1‐nonimaging‐concept‐description‐2003

Block diagrams

Other

ORGANISATION DETAILS

Name SKA Program Development Office

Physical/Postal

Address

Jodrell Bank Centre for Astrophysics

Alan Turing Building

The University of Manchester

Oxford Road

Manchester, UK

M13 9PL

Fax. +44 (0)161 275 4049

Website www.skatelescope.org

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 3 of 14

TABLE OF CONTENTS

1 INTRODUCTION ............................................................................................. 6

1.1 Purpose of the document ....................................................................................................... 6

2 REFERENCES ................................................................................................ 6

3 BACKGROUND .............................................................................................. 7

4 THE PROTOTYPE ............................................................................................ 7

4.1 Hardware description ............................................................................................................. 7

4.1.1 NVIDIA Tesla S1070 ......................................................................................................... 8

4.1.2 NVIDIA Fermi M2050 and GeForce GTX.......................................................................... 9

4.2 Software description ............................................................................................................... 9

4.2.1 GPU modules ................................................................................................................. 10

4.2.2 Beyond the standard dedispersion algorithms ............................................................. 12

5 TESTING IN A REAL ENVIRONMENT. .................................................................. 12

6 SUMMARY OF COSTS AND THE FUTURE .............................................................. 13

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 4 of 14

LIST OF FIGURES

Figure 1. Schematic diagram of the LOFAR ILS GPU backend. Components in orange make up the

backend. Details of the ILS on the right. ....................................................................................... 8

Figure 2. A very bright, dispersed IRP from pulsar B0329+54 at 150 MHz from the UK ILS. ................ 10

Figure 3. The result of the dedispersion module. Intensity, proportional to the radii of the circles, is

plotted versus time (x) and frequency (y). Pulses from a real pulsar B1508+55 are detected at a

DM of ~20. RFI is also seen at DM 0. Events of S/N >5 are shown. This illustrates a real time

detection of IRPs. ........................................................................................................................ 11

LIST OF TABLES

No table of figures entries found.

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 5 of 14

LIST OF ABBREVIATIONS

AA .................................. Aperture Array

Ant. ................................ Antenna

CoDR ............................. Conceptual Design Review

DRM .............................. Design Reference Mission

EoR ............................... Epoch of Reionisation

EX .................................. Example

DM ................................. Dispersion measure

FLOPS ........................... Floating Point Operations per second

FoV ................................ Field of View

GPU ............................... general purpose Graphics Processing Unit

ILS ................................. International LOFAR station

IRP ................................ Isolated Radio Pulses

Ny .................................. Nyquist

Ov .................................. Over sampling

PAF ............................... Phased Array Feed

PrepSKA........................ Preparatory Phase for the SKA

RFI ................................. Radio Frequency Interference

rms ................................ root mean square

SKA ............................... Square Kilometre Array

SKADS .......................... SKA Design Studies

SPDO ............................ SKA Program Development Office

SSFoM .......................... Survey Speed Figure of Merit

TBD ............................... To be decided

Wrt ................................. with respect to

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 6 of 14

1 Introduction

1.1 Purpose of the document

The purposes of this document are as follows:

1. Provide a description of a working prototype of a GPU based backend for real‐time, rapid‐

response time domain radio astronomy and searches of Individual Radio Pulses (IRPs).

2. Describe the design of a real scientific experiment with the prototype to demonstrate its

advantages and shortcomings.

3. Consider a pathway of application from current pathfinders to SKA‐I and SKA‐II

2 References

[1] SKA Science Case

[2] The Square Kilometre Array Design Reference Mission: SKA‐mid and SKA‐Lo v 0.4

[3] Science Operations Plan

[4] System Interfaces

[5] Environmental requirements (natural and induced)

[6] SKA strategies and philosophies

[7] Risk Register

[8] Requirements Traceability

[9] Logistic Engineering Management Plan (LEMP)

[10] Risk Management Plan (RMP)

[11] Document Handling Procedure

[12] Project Dictionary

[13] Strategy to proceed to the next phase

[14] WP3 SKA array configuration report

[15] WP3 SKA site RFI environment report

[16] WP3 Troposphere measurement campaign report

[17] SKA Science‐Technology Trade‐off Process (WP2‐005.010.030‐MP‐004)

[18] A. Faulkner, et al., Aperture Arrays for the SKA: the SKADS White Paper, January 2010.

[19] E. de Lera‐Acedo et al., System Noise Analysis of an Ultra Wide Band Aperture Array: SKADS

Memo T28.

[20] SKA Monitoring and Control Strategy WP2‐005.065.000‐R‐001 Issue Draft E

[21] “The Square Kilometre Array”, Peter E. Dewdney, Peter J. Hall, Richard T. Schilizzi, and T.

Joseph L. W. Lazio, Proceedings of the IEEE Vol. 97,No. 8, August 2009

[22] Thompson, A. R., Moran, J. M., and Swenson, G. W. “Interferometry and Aperture Synthesis

in Radio Astronomy” (second edition), Wiley, 1986.

[23] System Engineering Management Plan (SEMP) WP2‐005.010.030‐MP‐001Reference 3

[24] SKA System Requirement Specification (SRS)

[25] SKA IP Policy Document

[26] International Technology Roadmap for Semiconductors (ITRS), available at www.itrs.net.

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 7 of 14

3 Background

Pulsar science is one of the two areas that SKA‐I will concentrate on. The instrument is being optimised for successful extensive searching and timing campaigns. In recent years, the discovery of Rotating Radio Transients (McLaughlin et al. 2006) and intermittent pulsars (Kramer et al. 2006) has highlighted the fact that some of the most interesting radio pulsars are not regular emitters of radio, which has an impact on the design of search strategies. All short duration pulses of emission will suffer dispersion and scattering due to propagation in the ISM. However, for sporadic emitters, a periodicity search is not appropriate and other techniques need to be applied. In addition, there are specific advantages to being able to detect such "isolated radio pulses" (IRPs) in as close to real‐time as possible, such that triggered immediate follow‐ups can extract maximal information about the nature of the emitters. Apart from opening up the opportunity to detect extremely intermittent pulsars, these techniques will also open a window of discovery on all short duration radio bursts of astrophysical origin. Real‐time processing imposes particular requirements, which need to be satisfied in a pragmatic design. Between 2009 and 2011, we have put together a working prototype of a real‐time processing backend for blind searches of IRPs. We have used international LOFAR stations (ILS) as test‐beds for this backend. This choice was made on scientific grounds, as ILSs score highly on the combination of raw sensitivity, high time resolution and large field of view, allowing some optimism for successful early surveys. ILSs also provide a real‐world working environment for testing of hardware and software. The backend described here uses the high performance of general purpose graphics processing units (GPUs). We have put together this multi‐core architecture with multi‐threaded code in order to achieve the necessary operation counts for the real‐time processing of our particular application in the ILS environment. Multiple tests of the backend have been conducted by early 2011 and a test survey is planned for later this year, to demonstrate its advantages and shortcomings during a real scientific experiment. ILSs represent a current implementation of aperture array technology, therefore conclusions extracted from these experiments and tests can be reasonably well projected on to a path towards the SKA.

4 The prototype

4.1 Hardware description

The GPU backend we have developed for ILSs is matched to the datastreams coming out of the

LOFAR hardware. These consist of beamformed, raw, complex, 16‐bit data. The details of the LOFAR

datastreams are as follows: the analogue streams are sampled at 200 or 160 Msamples/s, and

channelized to 512 channels using a polyphase filter. The resulting raw complex subbands are 195.3

or 156.2 kHz wide, with a time resolution of 5.12 or 6.4 μs, depending on the value of the clock. A

total of 244 subbands are beamformed into anything up to 244 beams. The 244 beams, which

correspond to 3.2 Gbps, are separated into 4 streams of UDP packets, each of ~800 Mbps. This

bandwidth can be carried by 1gbe technology which features throughout our backend.

Figure 1 shows a schematic diagram of the hardware. In light blue are all the LOFAR components and

in orange are the components of our backend. In particular, the processing units are made of dual

socket, 6‐core Intel Xeon CPUs (5650, 2.66 GHz clock, 32nm lithography and 95W power rating).

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 8 of 14

Figure 1. Schematic diagram of the LOFAR ILS GPU backend. Components in orange make up the backend.

Details of the ILS on the right.

The GPU Units are used for real time dispersion searches for IRPs. We have built and tested a

machine based on NVIDIA Tesla S1070 blades, and we are in the process of building and testing a

machine based on NVIDIA Fermi M2050 cards. We are also planning to test NVIDIA GeForce GTX

cards, which are not server grade but offer very high processing power for money. Currently, a single

unit of the CPU‐GPU backend can effectively search for ms dispersed transients at Δν/ν~0.1, given a

1gbe stream of data, in real time at a cost of about 9kEuros for machines with server‐grade GPUs

and 4kEuros for non‐server grade. The LOFAR example is 16‐bit complex samples, 800 mbps data

streams, which translate to 12 MHz sky bandwidth @ 150 MHz; This backend can process several

thousand dispersion measures (DM) in real time, depending on the dedispersion technique.

4.1.1 NVIDIA Tesla S1070

The first thing to note is that production of these cards has now ceased. However they offer a very

competitive platform for GPU computing on a 24/7 operations basis. According to NVIDIA, these

blades nominally offer up to four teraflops of computing performance in a 1U configuration. Each

blade is made up of 4 NVIDIA Tesla C1060 cards, with 240 cores and 4GB of on‐board memory per

card. The PCIe connection between the S1070 blades and the CPU servers has a nominal bandwidth

of up to 6.4 GB/s, which by far exceeds the LOFAR data rate being processed. The on‐board memory

is necessary for storing an array of filterbank data (intensity as function of frequency and time) in

order to process dispersed events which spread across time delays of many tens of seconds at

LOFAR frequencies (see section on dedispersion algorithms). Since searching for IRPs involves

dedispersion at many unknown DMs, and dedispersion at one DM is independent of the results of

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 9 of 14

dispersion at other DMs, this process is entirely parallelisable and benefits from multi‐threaded

architectures. The S1070s run ~1.3 GHz clocks and use approximately 700W of power when running.

The idle power for an S1070 does not drop below 200W. Cooling requirements are therefore non‐

negligible. For a description of the C1060 cards upon which the S1070 blades are based, visit:

http://www.nvidia.com/docs/IO/43395/BD‐04111‐001_v05.pdf

4.1.2 NVIDIA Fermi M2050 and GeForce GTX

The current generation of supercomputing cards from NVIDIA are called Fermi, and they feature

more cores and much faster double precision computing than the C1060 cards. In particular, the

M2050 have 448 cores, and 3GB of on‐board memory that, although less than the C1060 cards, is

still sufficient to process several tens of seconds of ILS beamformed data. The speed‐up in double

precision will make no significant difference. However, the substantial increases in L1 and L2 cache

memory are likely to provide substantial improvements over the C1060s. Early tests indicate at least

a factor of 2 in performance improvement, which roughly cancels out the current difference in price.

On the other side of the pricing spectrum, NVIDIAs gaming cards have been steadily improving in

clock speed, number of cores, on‐board memory and reliability. There are implementations of the

GeForce GTX 580 with 3GB of on‐board memory and 512 cores, at a fraction (~25%) of the cost of

the M2050. We will be testing such cards as much as possible over the coming months. Currently,

the GTX 580 nominally requires about 250W of power during operation (closer to 350W at full load)

and about 150W when idle. These values are similar to the M2050s, which share the same chips with

the GTX 470 cards. It should be noted that despite this, NVIDIA only provide single precision support

on the GeForce range.

4.2 Software description

The software that runs on the backend are modular pipelines, based on the PELICAN framework

developed under PrepSKA. Care has been taken to ensure that each module of the pipeline can

operate at better than real‐time rates. The modules have been developed with the specific aim of

delivering the appropriate data to the GPU module for the dispersion search for IRPs. The framework

is responsible for inter‐modular communications, via TCP, and mechanisms for accessing and

processing the content of the data‐blocks that are passed through. It also contains the appropriate

mechanisms for reading in the UDP datastreams via a software server, which passes down TCP

streams of any size to a flexible number of clients. The clients are responsible for buffering,

processing and writing out to file at the end. The framework and modules are written in C++.

Currently, our pipeline contains the following modules:

UDP data reader and TCP server

Buffer and datablock generator

Polyphase channeliser – 2N channels per subband

Stokes generator – conversion from complex data to power

RFI clipper ‐ removal of narrowband interference spikes from spectra

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 10 of 14

Integrator – addition of 2N time bins

Dedisperser and dispersion search – GPU module including second buffer

File writer – binary data output of chosen stream

The CPU modules are necessary to bring the data into the rate form for the GPU processing.

4.2.1 GPU modules

The necessity to develop GPU modules for IRP detection comes from the processing requirements

for detecting an IRP of unknown dispersion measure, such as an irregular pulse from a new Rotating

Radio Transient or a giant pulse of a yet undiscovered pulsar.

Figure 2. A very bright, dispersed IRP from pulsar B0329+54 at 150 MHz from the UK ILS.

An example dispersed IRP observed with the CPU‐GPU backend at an ILS is shown in Figure 2. The

data have been integrated in time by a factor of 64 from the original 81.92μs to 5.24ms. The

principle of detecting a dispersed IRP of unknown DM relies on finding an appropriate curve in time,

frequency space along which to integrate over the given bandwidth in order to maximise the signal

over the instrumental noise. The cold plasma dispersion law, which describes well the dispersion

seen in radio pulses, states that the delay in time of arrival is proportional to the frequency to the

power ‐2. The proportionality constant, or dispersion measure, is directly related to the number of

free electrons in the line of site of the observation.

Figure 2 shows a pulse that is bright and visible within individual frequency channels. Finding such is

pulse is not difficult, and several techniques can be applied: Once a single high S/N point has been

found, the next point can be found by sampling the neighbouring points in time and frequency for

more significant points, until the description of the curve can be built. However, the weakest IRPs

that an instrument can detect will be well below the noise level in individual bins, and only

integration in frequency will reveal them as significant. The problem then becomes to find the path

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 11 of 14

of integration that will maximise the S/N. In the specific case of the cold plasma dispersion law,

there is a single degree of freedom, the DM.

The first approach at solving this problem is by applying a brute force technique. This means

transforming the incoming data from 3D data of intensity versus frequency and time I(f,t) to 3D data

of intensity versus DM and time I(DM,t), by applying the appropriate time delays per frequency

channel for each DM to be searched within a given range, and integrating over the frequency

dimension. This is approximately an N2 algorithm for the number of frequency channels. For typical

values of 512 frequency channels and 81.92μs sampling time, the transformation alone requires

sustained processing of ~30 GFLOPS to process 5000 DMs. This is the limit of the capabilities of the

hardware tested here. The GPU code (CUDA kernel) written is a direct translation of the standard

CPU dedispersion kernels. Although this provides a substantial improvement over the available CPU

codes on similarly priced hardware, this kernel has large margins for optimisation. It takes advantage

of the multiple cores on the GPU, but does not yet take advantage of operations the GPU hardware

has been optimised for, such as 3D matrix rotation.

In order to reduce the computational demand, there exist at least two known algorithms which are

commonly applied in CPU dedispersion, namely the Taylor tree algorithm and subband dedispersion

(from the presto package by S. Ransom). The tree algorithm avoids redundant sums and effectively

reduces the computational load to N‐logN, however it can only be applied in the case where the

relative bandwidth is sufficiently narrow to approximate the dispersion delay by a linear function. In

the case of subband dedispersion, the principle is to split the total bandwidth into subbands, and

perform a coarse dispersion search within each subband. Then, to achieve fine DM gridding, for each

of the coarse DMs which reduces the frequency channels in each subband to one, a second stage of

dedispersion occurs at a larger number of DMs. The number of channels involved at the second

stage is equal to the number of subbands chosen. It can be shown that the approximation in the

algorithm does not significantly affect the result, and the gain in computational effort can be

significant based on the number of frequency channels.

Figure 3. The result of the dedispersion module. Intensity, proportional to the radii of the circles, is plotted

versus time (x) and frequency (y). Pulses from a real pulsar B1508+55 are detected at a DM of ~20. RFI is also

seen at DM 0. Events of S/N >5 are shown. This illustrates a real time detection of IRPs.

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 12 of 14

It is not the scope of this document to provide details on the known dedispersion algorithms, which

can be found elsewhere (e.g. Lorimer and Kramer pulsar handbook). It must be said however that all

the above algorithms are examples of incoherent dedispersion, i.e. dedispersion that applies to the

total power versus time and frequency. The technique that recovers the closest to the original signal,

coherent dedispersion, involves convolution of the incoming complex data (voltage with phase

information) with a chirp function that represents the inverse of the effect of interstellar dispersion.

Coherent dedispersion is known to work well on GPUs from the works of I. Cognard (Nancay) and P.

Demorest (NRAO). However, the computational requirements of the convolution make it more

suitable for single, known DM dedispersion rather than dispersion searches over large ranges of DM

(several thousands). In this respect and for the above algoriths, our tests indicate a relative speedup

between GPUs and CPUs of typically two orders of magnitude for equally priced hardware.

4.2.2 Beyond the standard dedispersion algorithms

One way of reducing the cost of GPU dispersion searches is optimising the algorithms to run on

effectively less hardware. We are working on this approach in two directions. The first is to do with

linearization of the problem, which makes it suitable for algorithms such as tree dedispersion. We

are designing a new dedispersion kernel that takes advantage of GPU capability for fast matrix

rotation. Effectiveness of GPUs is increased dramatically when accessing neighbouring memory

addresses with neighbouring threads, where memory calls are minimised. The index shifting

algorithms used for dedispersion today, do not take advantage of this aspect and we are focusing

efforts on trying to improve that.

The second path is through developments in adaptive sampling techniques in information theory.

We are working on a feedback mechanism, which will decide on the next sample in the frequency

versus time domain based on the dedispersed intensity of the previous measurement(s). In an

environment where the noise is well characterised, such algorithms can focus “attention” on

interesting areas of the data, directing most computational power to the relevant areas. We are

developing such an algorithm within the restrictions and special attributes of GPU environments, to

be tested on our ILS backend.

5 Testing in a real environment.

With the sampling rate offered by the ILS, we have identified the useful parameter space to sample

for DM searches of bright IRPs. The DM range to be search relates to the observing frequency; at low

frequencies from tens to a few hundred MHz, the maximum DM at which an IRP can be expected is

related to the total scattering effect that the same electrons will have, reducing the peak intensity of

the pulse to below detectable levels. At higher frequencies, the DM at which a search is relevant is

more related to the DM distribution from models of the Galactic electron density distribution.

Typical values (in DM unites) for a search are a maximum DM of 100 for LOFAR frequencies and

potentially several thousands for high radio frequencies. The channelization required for incoherent

dedispersion and the DM step are also frequency dependent, with lower frequencies requiring

narrower channels and finer DM steps than high frequency searches. At 150 MHz, a typical search

for 1ms wide IRPs requires ~6kHz frequency channels and a DM step of 0.05 in DM units. The

allowed frequency resolution directly translates into a time resolution of 160 μs, which is sufficient

to measure the rise and fall of a millisecond IRP. This is another route to set the maximum DM, as for

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 13 of 14

higher DM values, finer channelization is required, which increases the time bin duration to beyond

what is useful to measure millisecond IRPs.

The backend at the LOFAR‐UK station is operational and can process several thousand DMs in real‐

time, depending on the choice of GPU dedispersion kernel. Figure 3 shows a few seconds of output

as an example of the resulting data from the dedispersion module. The x and y‐axes are time and

DM, and the diameter of the circles is proportional to the intensity. Circles are plotted whenever the

intensity is 5 times the RMS above the noise level. These data were taken during an observation of a

bright pulsar (B1508+55) and IRPs from it can be see as persistent events at a DM just under 20. The

conclusion is that at an initial cost of ~5kEuros per beam of 6MHz bandwidth (Δν/ν=3%), plus 1kW of

power, a dispersion search can be conducted today using an ILS.

Initial positive testing has motivated a longer, real survey for IRPs using ILSs, which we expect to

complete within the PrepSKA programme. This will provide a clearer understanding of the

interesting parameter space to be searched, and provide long‐term measurements for the mean

processing power we can expect from the GPU hardware, as well as other possible bottlenecks

including smooth and continuous network functions for the distribution of the data, PCIe bandwidth

issues for the IO between CPU and GPU and memory bandwidth issues in the GPUs.

6 Summary of costs and the future

We have put together a complete hardware and software backend to perform real time

dedispersion for search of IRPs, which we are currently testing out on LOFAR international stations.

The cost of the machine can be broken down as follows:

‐ 12‐Core INTEL Xeon server to perform CPU preprocessing, including buffering of 800 Mbps

beamformed data: 3.5 kEuros

‐ NVIDIA Tesla GPU cards:

o S1070 (4xC1060), capable of sustained processing up to ~5000 DM values over a 12

MHz band at 150 MHz with current, non‐optimised dedispersion kernels: 5.5kEuros

o M2050, capable of processing up to ~2000 DM values over 12 MHz of bandwidth at

150 MHz: 1.5kEuros

o GeForce GTX card, non server grade but potentially equivalent to the M2050 (tests

pending): 0.4kEuros

The power consumptions are a total of ~1kW for each of these solutions per 800 Mbps of raw data

bandwidth or 12 MHz of LOFAR sky bandwidth.

It should be noted here that these costs are estimated at a sky frequency of 150 MHz, where 12 MHz

of band is 10% the sky frequency. The cost estimates for other frequencies should be based on the

bandwidth to frequency ratio and not the absolute bandwidth. Also, the CPU host machines play a

crucial role by buffering the data (in amounts that optimise IO between CPU and GPUs), which

means that many tens of seconds of data (>100s of ILS data) can be processed at once on current

GPU boards with 3‐6 GB on‐board memory.

WP2‐040.130.010‐TD‐001

Revision : 1

2011‐04‐01 Page 14 of 14

The hardware of the backend is all rack‐mountable, and cooling is done with standard room air‐

conditioning. Also, the total cost for software effort is currently hard to estimate. The reasons are

that a) we are still in the design phase for a new algorithm and not sure how much coding will be

required and b) we are currently conducting long (multi‐day) tests to establish to shortcomings of

both hardware and code in a real life test.

The conclusion to be drawn at this stage is that cheap multi‐core technology such as what is

implemented on GPU chips can be used today, at reasonable up‐front and running costs in an

implementation for LOFAR searches of dispersed IRPs. The hardware and running costs today match

well the total bandwidth of data that is processed. All indications are that Moore’s law requires a

move to massively multi‐core CPUs in order to maintain relevance, and GPU processing is the first

step. With the next generation of CPU processors, such as the Intel Sandy Bridge, GPU type cores will

coexist on CPU chips, indicating a new area in CPU architecture. Dispersion searches for IRPs are vital

to expand the parameter space of known pulsars, and can lead to discovery of other astrophysical

events. Based on the above, persisting in the effort to characterise and optimise GPU usage in this

field appears to be both useful and necessary.

Documents

SKA NON IMAGING PROCESSING CONCEPT GPU PROCESSING … · channelized to 512 channels using a polyphase filter. The resulting raw complex subbands are 195.3 or 156.2 kHz wide, with