Low-latency data acquisition to GPUs using FPGA-based 3rd ......GPU RAM ~ FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation

RTC_4_AO workshopPARIS 2016

Low-latency data acquisition to GPUs using FPGA-based

3rd party devices Denis Perret, LESIA / Observatoire de Paris

1

RTC for ELT AO

2

HardRealTime

SoftRealTime

tele-metry40

Ge Net

work

high

ban

dwidth

low

late

ncy

Sensors 40Ge

Net

work

high

t ba

ndwidt

h

DeformableMirror

Using FPGA to reduce acquisition latency

~  Interface : 10 GbE, low latency acquisition interface

~  W/O Direct Memory Access to the GPU ram : multiple data copy = introduced latency

3

GPURAM

GPU

HostRAM

HostCPU

PCIe

10GeNIC


~  Interface : 10 GbE, low latency acquisition interface

~  With Direct Memory Access to the GPU ram : single data transfer

= low latency, maximum bandwidth

4

GPURAM

GPU

HostRAM

HostCPU

PCIe

10GeNIC


No PCIe P2P at all.

NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppHandles the interruptsand the datadecapsulation

Encapsulated Data

~  The NIC DMA engine sends data to host memory and sends an interrupt to the CPU when the buffer is (half)full.

~  The CPU suspends its tasks and saves its current state (context switch).

~  The CPU decapsulates the data (Ethernet,TCP/UDP, GigeVision…).


No PCIe P2P at all.

~  The CPU builds a CUDA stream containing GPU operations as kernel launch and memory transfer orders.

~  The stream is sent to the GPU

~  The pixels are copied on the GPU RAM (GPU DMA, reading process over PCIe)

NIC

PCIe

Host Ram

GPU

Ram

Host

CPU App Initiates the dataand operations streamtransfer

Dec

apsu

late

d D

ata


No PCIe P2P at all.

~  The CPU waits for the computation to end (or does something else)

~  The results are sent to the CPU RAM.

~  The synchronization mechanisms between the GPU and the CPU are hidden (interrupts…).

NIC

PCIe

Host Ram

GPU

Ram

Host

CPU App

Com

puta

tion

Res

ults Does something

else. Gets interrupted periodically.


No PCIe P2P at all.

~  The CPU encapsulates the results (Ethernet, TCP/UDP, …)

~  The CPU initiates the transfer by configuring and launching the NIC DMA engine (reading process).

NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppEncapsulates the dataInitiates theData transfer

Encapsulated Results


With PCIe P2P and Custom NIC

~  The CPU gets the address and size of a dedicated GPU buffer and use it to configure the NIC DMA engine.

~  The data are written directly to the GPU memory.

~  The GPU infinitely detects new data by polling its memory (busy loop), performs the computation and fills a local buffer with the results.

Custom NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppEncapsulated Data Data

Decapsulation

TOE

PollingKernel

Minding hisown businessD

ecaps.Data


Getting back the results: 1rst way

~  The CPU gets notified in a hidden way that the computation.

~  The CPU initiates the transfer from GPU to FPGA (FPGA DMA engine -> reading process over PCIe). Custom NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppData

Decapsulation

TOE

PollingKernel

Still minding hisown business

Encaps. Results

Res

ults


Getting back the results: 2nd way

~  The GPU sends directly the data by writing to the NIC addressable space (GPU DMA -> write process over PCIe).

Custom NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppData

Decapsulation

TOE

PollingKernel


Encaps. Results

Res

ults


Getting back the results: 3rd way

~  The GPU notifies the FPGA by writing a flag on the FPGA addressable space.

~  The FPGA gets the data back (FPGA DMA engine -> reading process).

Custom NIC

PCIe

Host Ram

GPU

Ram

Host

CPU AppData

Decapsulation

TOE

PollingKernel


Encaps. Results

Res

ults

Using FPGA to reduce acquisition latency: First tests ~  Stratix V PCIe development board from PLDA (+ QuickPCIe,

QuickUDP IP cores) 42 Gb/s demonstrated from board to GPU; 8.8 Gb/s per 10GbE link in loopback mode

~  10 GbE camera from Emergent Vision Technologies (8.9 Gb/s to GPU mem), GigEVision protocol, 1.5 kFPS in 240x240 pixels coded on 10bits (360 FPS in 2k x 1k on 8bits or 1k x 1k on 10bits)

13


14

PHY UDP

DMA

DMA

DMA

DMA

DMA

DEMUXDECAPS

PHY UDP

Vision_protocol_handling

DATAGENERATOR

Latencymeasurement

DMC

pixels

commands

answers

PCIe

3.0

CPU app

cameracontrol

gpu_ram

Hostram

loop

back

computationkernels

CUSTOM_NIC

GPU

pixelsringbuffer

pollingkernel

DMcombuffer

measurements

deformable mirror commands

P2P from FPGA to GPU (way back still launched by the CPU)

15

600�

700�

800�

900�

1000�

1100�

1200

1� 63�

125�

187�

249�

311�

373�

435�

497�

559�

621�

683�

745�

807�

869�

931�

993�

1055�

1117�

1179�

1241�

1303�

1365�

1427�

1489�

1551�

1613�

1675�

1737�

1799�

1861�

1923�

1985�

2047�

2109�

2171�

2233�

2295�

2357�

2419�

2481�

2543�

2605�

2667�

2729�

2791�

2853�

2915�

2977�

3039�

3101�

3163�

3225�

3287�

3349�

3411�

3473�

3535�

3597�

3659�

3721�

3783�

3845�

3907�

3969�

4031�

4093

no p2p & computation

p2p & computation

no p2p & no computation

p2p & no computation

Interrupts vs polling (on the cpu)

16

carte PLDA "Quickplay"

carteGPU

pollingkernel

PCIE

10GbEPC APC B

carte PLDA"non QP"

computingkernel(s)

RAMGPU

QPCIEDMA

UDP0

UDP1

UDP0

UDP1

DDR

DDR

DDR

hdl-kernel

ImageGen

hdl-kernel

LatencyMeas

c-kernel

COG(ou pas)

BAR2


17

-100�

0�

100�

200�

300�

400�

500�

600�

700�

9300� 9350� 9400� 9450� 9500� 9550� 9600� 9650�

Optimized Interrupt+ "stress -c 8"

Memory polling

Non Optim Inter Non Optim Inter+ "stress -c 8"


18

-100�

0�

100�

200�

300�

400�

500�

600�

700�

9330� 9335� 9340� 9345� 9350� 9355� 9360�

Polling+ isolated CPU

Polling+ non isolated CPU

Optimized Interrupt

Using FPGA to reduce the load on GPU/PCIe/network…

19

DPRAM CoG FIFODPRAM CoG FIFODPRAM CoG FIFODble portRAM

CoG FIFO

DMA over PCIe

read direction

N*N subimages

{N DPRAMGPU

RAM

~  FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation.

~  Less load on the PCIe, smaller buffers on the GPU ram, less latency

~  Could be done in the WFS, so we lower the load on the network

…or even do everything with FPGAs

20

~  Exploring the possibilities with high level development environments:

~  Vendor specific HLS. Xilinx Vivado,…

~  Quickplay: based on KPN. Has its own HLS. Is gonna be able to use other HLS. They are planning to integrate PCIe P2P.

~  OpenCL: is available for FPGAs (Altera), CPUs, GPUs. Can now handle network protocols: Altera introduced I/O channels allowing kernels read and write network streams that are defined by the board designer. P2P over PCIe should be possible (synchronization…?). Integration of custom HDL blocks?

~  Matlab to HDL. Did someone try it? Maybe useful to help developing IPs.

-> Interactions with HPC tools as MPI, DDS, Corba are quite challenging

Documents

Low-latency data acquisition to GPUs using FPGA-based 3rd ......GPU RAM ~ FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation