Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Accelerated Data Processing on SoC with FPGA
Marek Vasut <[email protected]>
June 3, 2015
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Marek Vasut
I Software engineer at DENX S.E. since 2011I Embedded and Real-Time Systems Services, Linux kernel and
driver development, U-Boot development, consulting, training.
I Versatile Linux kernel hacker
I Custodian at U-Boot bootloader
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Structure of the talk
I Motivation
I Introduction to FPGAs
I Your first FPGA data cruncher
I Interfacing with Linux
I Speeding things up
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Why listen to this talk
I Get fresh ideas
I Learn something new
I Reduce energy envelope of your device
I Process data quickly and efficiently
You won’t learn marketing stuff or random benchmark numbers
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
FPGA
I Abbr. for Field Programmable Gate Array
I Programmable logicI Usually used for:
I Digital Signal Processing (DSP)I Data crunchingI Custom hardware interfacesI ASIC prototypingI . . .
I Common vendors – Xilinx, Altera, Lattice, Microsemi. . .
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Internal structure
W.T.Freemanhttp://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif
CC BY 2.5: http://creativecommons.org/licenses/by/2.5/
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
FPGA and the outside
I FPGA has plenty of I/O options:I Regular I/O with configurable voltage levelsI Differential I/OI High-speed SerDesI . . .
I Usual interface with host:I Stand-alone FPGA, usually PCIe, USB, . . .I FPGA on a CPU bus (PowerPCs, ie. ML507)I Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Programming the FPGA
I Each vendor has his own tools – Altera Quartus, Xilinx Vivado
I FPGA tools often closed source :-(
I FPGA bitstream format is closed :-(
I Basic vendor tools available free of charge
I Sufficient amount of functionality to implement data cruncher
I Vendor tools needed for place-and-route and assembler
I Third-party tools for synthesis are available
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Comparison to a GPU – I.
CPU GPU FPGA
Toolchain Open Closed ClosedHW design Proprietary Proprietary Your ownHW units Fixed Fixed As neededI/O Limited None As needed
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
HDL – Hardware Description Language
I FPGA content is written in HDLs
I HDL – Hardware Description Language
I HDLs are used to model behavior of logic block
I Two major HDLs – VHDL and Verilog
I Tools often allow seamless mixing of HDLs
I Many readily-available cores under acceptable license:OpenCores http://opencores.org/
OpenCores projects http://opencores.org/projects
CERN Open HW Repo http://www.ohwr.org/
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Modeling behavior
HW Behavior modeling vs. Writing CPU code:
I Vastly different and confusing to software people :-)
I CPU: Programmer implements an algorithm
I FPGA: Programmer implements hardware to run the algorithm
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Implicit parallelism
I Everything in a block is executed in parallel
I All conditions in a conditional statement are tested in parallelif, case – differs from C
1 if (foo == 1) bar <= 1’b0;
2 else bar <= 1’b1;
I Blocks are executed in parallel
1 begin
2 x <= 1’b0;
3 y <= 1’b1;
4 end
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Combinatorial vs. Sequential logic
I Combo – imm. value of var is the product of the imm. inputsof the function:
assign Z = X ^ Y;
I Seq logic is sync to clock (involves a latch)
always @(posedge clk)
Z <= DAT;
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Verilog example
I Looks like C, based on C, but behaves differently
I Used a lot in Europe
I Example: CRC5, polynomial x5 + x2 + x0
I Example modified from:http://www.asic-world.com/examples/verilog/
serial_crc.html
1 module crc5 (
2 /* SYSTEM I/O */
3 input reset,
4 input clk,
5 /* CRC5 I/O */
6 input data,
7 output reg [4:0] crc
8 );
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Verilog example II
1 always @(posedge clk) begin
2 if (reset) begin
3 crc <= 5’b00000;
4 end else begin
5 crc[0] <= data ^ crc[4];
6 crc[1] <= crc[0];
7 crc[2] <= crc[1] ^ data ^ crc[4];
8 crc[3] <= crc[2];
9 crc[4] <= crc[3];
10 end
11 end
12 endmodule
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
VHDL example
I Distinctive syntax based on Ada
I More explicit typing system than Verilog
I Used a lot in the USA
I Example: CRC5, polynomial x5 + x2 + x0
I Example from http://outputlogic.com/?page_id=321
1 library ieee;
2 use ieee.std_logic_1164.all;
3
4 entity crc is
5 port ( data_in : in std_logic_vector (0 downto 0);
6 rst, clk : in std_logic;
7 crc_out : out std_logic_vector (4 downto 0));
8 end crc;
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
VHDL example II
1 architecture imp_crc of crc is
2 signal lfsr_q: std_logic_vector (4 downto 0);
3 signal lfsr_c: std_logic_vector (4 downto 0);
4 begin
5 crc_out <= lfsr_q;
6 lfsr_c(0) <= lfsr_q(4) xor data_in(0);
7 lfsr_c(1) <= lfsr_q(0);
8 lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0);
9 lfsr_c(3) <= lfsr_q(2);
10 lfsr_c(4) <= lfsr_q(3);
11
12 process (clk,rst) begin
13 if (rst = ’1’) then
14 lfsr_q <= b"11111";
15 elsif (clk’EVENT and clk = ’1’) then
16 lfsr_q <= lfsr_c;
17 end if;
18 end process;
19 end architecture imp_crc;
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Comparison to a GPU – II.
CPU GPU FPGA
Languages All OpenCL, CUDA OpenCL, HDLsDesign paradigm Sequential Seq/Par ParallelDesign granularity Instruction Instruction GateOpt. possibility Low Low HighOpt. difficulty Low Low High
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Development and debugging
I Simulation (on developer’s system)
I Probing (on-target)
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Simulation
I Simulation tools:Icarus Verilog http://iverilog.icarus.com/
ghdl http://home.gna.org/ghdl/
ModelSim http://en.wikipedia.org/wiki/ModelSim/
I Write testcase for a module in an augmented HDL
I Execute testcase
I Observe results
I View waveformsI Decode and inspect bussesI Trigger on complex conditionsI . . .
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Probing
I Used to observe design on target
I Think of this as a bus analyzer in the FPGA
I Probing tools (ie. SignalTap)
I Design is augmented with a probing IP, FPGA isreprogrammed
I Probing is controlled through a debug probe attached to theFPGA (JTAG or similar)
I Probe internal signals, observe waveforms, trigger on complexconditions. . .
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Structuring the design
I HDL files – lowest in the hierarchy
I IP block – collection of HDL files with an interface
I FPGA design – collection of IP blocks
I Vendor tools contain tools to assemble IP blocks into FPGAdesign – ie. Altera QSys.
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Comparison to a GPU – III.
CPU GPU FPGA
Simulation QEMU ? Icarus, ModelSimDebugger GDB CUDA-GDB, CodeXL SignalTap
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Linux interface
I No standard in-kernel FPGA interface due to variance ofdesigns
I Attempts do exist:I Device Tree Overlay(s) stored in FPGAI SDB –
http://www.ohwr.org/projects/fpga-config-space
I Usually there are control registers in the FPGA design
I Usually the DMA is involved (either on FPGA or CPU side)I Two options for controlling the FPGA:
I Custom Linux kernel driverI Userspace utility
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Custom kernel driver
I Driver written to match the particular FPGA bitstream
I Driver can crash the host machine if written badly :-(
I Driver usually exports custom userland I/O
I splice(2)
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Userland approach
I Userland accesses the FPGA registers via uio
I The uio is like a restricted devmem
I In case DMA is involved, kernel module to prepare the datafor the DMA (ie. assure cache coherency) is needed.
I CMA might be used to export large slab of custom kernelmemory to user
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Performance tuning
I FPGA is clocked at 50...200MHz, not much
I The fabric is usually rated at much more!
I Synthesize PLL, which generates faster clock
I Clock your design from the PLL
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Design tricks
I Use combo logic where possible
I Create pipelined designs and make sure the pipeline issaturated
I Synthesize multiple units and compute in parallel
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Altera OpenCL
I OpenCL EP 1.0 implementation for Altera FPGAs
I Easier for SW developers
I Only thin shim must be ported to the FPGA
I Closed source compiler :-(
I Even needs a license :-C
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
Conclusion
I FPGAs are strong in parallel, pipelined workloads
I FPGAs give the user almost gate-level performance
I FPGAs are manufactured using bleeding-edge process
I FPGAs deploy excellent Performance-per-Watt
I There is no simple unified Linux interface
I Developing an FPGA content can be difficult
I The FPGA ecosystem is still rather closed
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA
The End
Thank you for your attention!
Contact: Marek Vasut <[email protected]>
Marek Vasut <[email protected]> Accelerated Data Processing on SoC with FPGA