EEL 4930/5934, Fall 09
December 3-4, 2009
Novo-G : Adaptively Custom Novo-G : Adaptively Custom Reconfigurable Reconfigurable SupercomputerSupercomputer
Dr. Alan D. GeorgeProfessor of ECE
University of Florida
Dr. Herman Lam Assoc. Professor of ECE
University of Florida
Abhijeet LawandeCarlo PascoeResearch Assistants
CHRECUniversity of Florida
High Performance High Performance ComputingComputing Uses supercomputers / distributed computers to solve advanced computation problems
Where? Computational Fluid Dynamics Astrophysical Simulations Climate Modeling …
How Big? 100’s of nodes,
1000’s of processors
3
HPC MarketplaceHPC Marketplace HPC practitioners often more reactive than proactive
Understandably conservative, risk-averse Looking for quick fixes (not always best approach for long-term)
Accelerators (e.g. GPU, Cell) popular @ SC09 But these consume more (energy) to get more (performance)
Performance promising for subset of apps (on fixed-logic spectrum) Productivity a significant challenge (common in Age of Parallelism) Sustainability a major concern (single devices approaching 300W!)
But better solutions borne from better methods Goal: high performance, productivity, & sustainability Change in paradigm, mindset, approach
“Every generation needs a new revolution” – Jefferson Smarter device and system architectures
Adaptive hardware parallelism, more (performance) with less (energy) Better models & solutions apply more broadly than only HPC
Reconfigurable ComputingReconfigurable Computing Why RC?
Performance (parallelism) Power Price
So what’s the problem? New computing model: revolutionary, potent, complex
Adaptive hardware offers many challenges & opportunities Still relatively new and immature field
Many open R&D issues, from prog. model to device arch.
5
Novo-G ConceptNovo-G Concept Goals
Investigate, develop, evaluate, & showcase: Most powerful RC machine ever fielded for research Innovative suite of productivity tools for app development Impactful set of scalable kernels/apps in key science areas
Project & machine name: Novo-G “Novo” is Latin: "to make anew, refresh, revive, change, alter," essence of RC “G” for Genesis (first of a series of Novo machines) or Green
Focus on experimental research challenges of RC spanning HPC to HPEC Motivations
Design productivity is foremost need/challenge for widespread use of RC Challenges accentuated as scale increases (devices, systems, apps) Powerful experimental testbed to support R&D addressing these challenges
Emphases Performance (system), Productivity (concepts/tools), Impact (apps)
6
Novo-G MachineNovo-G Machine Cluster of 24+1 servers (compute + head node)
96 Altera Stratix-III E260 FPGAs for app acceleration Each w/ 768 18x18 multipliers, 254K logic elements, 204K registers, power <20W
e.g. Per E260: 768 Integer, 192 SPFP, or 85 DPFP multipliers @ ~300MHz (Altera FPC)
FPGAs housed in four quad-FPGA PCIe x8 GiDEL boards Embedded-style boards; supports both HPEC- & HPC-oriented research
4.25GB memory attached to each app FPGA, 576GB total RAM in Novo-G
24 boards housed in 24 Linux compute servers + head node 20Gb/s non-blocking DDR InfiniBand; Gigabit Ethernet
26 (24+2) quad-core 2.26GHz Intel Nehalem Xeon processors w/ QPI
Funded by U. Florida w/ generous help from Altera & GiDELUPDATE: Novo-G will soon double in RC capacity,
growing to 192 top-end FPGAs in 48 quad-FPGA boards
Novo-G MachineNovo-G Machine1 head-node server with:• 1U rackmount chassis• Dual Xeon E5520 quad-core CPUs @
2.26 GHz, 4MB Cache, 5.86 GT/s QPI• 24GB ECC DDR3, 1333 MHz• Integrated dual-GigE ports & video• ICH10R controller for 6 SATA drives• 3 x 1TB Enterprise SATA2 drives
7
KVM/LCD unit for head node
* Our cluster vendor is Ace Computers
24 compute servers, each with:• 4U rackmount chassis with 645W P/S• Intel Xeon E5520 quad-core CPU • 6GB ECC DDR3, 1333 MHz• Integrated dual-GigE ports & video• 2 GiDEL ProcStar-III PCIe x8 cards• Mellanox DDR InfiniBand PCIe card• 250GB SATA2 drive
Not visible (IB & GigE switches, PDUs)
Altera Stratix-III E260 FPGA254,400 Logic Elements768 multipliers (18×18)14,688 Kbits of embedded memory50% less power than Stratix-II65nm technology
8
Novo-G ProcStar-III Board Novo-G ProcStar-III Board (one of 48)(one of 48) 2×2GB = 4GB
DDR2 RAM per FPGA
25.6 GB/s inter-FPGA bandwidth110 lines bi-directional
/ 110 lines bi-directional/
110 linesbi-directional/
256MB DDR2 256MB
DDR2 256MB DDR2 256MB
DDR2PCIe x8 interface (4GB/s)
GiDEL ProcStar-III BoardTypical frequencies 100-325MHz DMA channels 32DDR2 module slots 8
120.8 mm
312 mm
JTAG for SignalTap debug
FPGA2
Novo-G Memory & Novo-G Memory & ConnectivityConnectivity Head
node24 GB DDR3
Compute node
6 GB DDR3
FPGA1FPGA2 + memory
FPGA3 + memory
FPGA4 + memory
Compute nodes
Main bus
2x 2GB DDR2
SODIMM
256 MB DDR2
667MHz
PCI-Express x8
GigE
Infiniband
2x 2GB DDR2
SODIMM
Memory Bus
Power Consumption of Each Novo-G Server
111
47
210
52
57
117
0
50
100
150
200
250
300
350
Server Only Boards Only Server+Boards
Wat
ts Loaded
Idle
10
Novo-G Energy Novo-G Energy (each of 24 (each of 24 servers)servers)
Smith-Waterman applicationQuad-core E5520 Xeon CPU2 GiDEL ProcStar-III boards8 Stratix-III E260 FPGAs total40GB (17×2+6) DDR2/3 RAM
Smith-Waterman applicationQuad-core E5520 Xeon CPU2 GiDEL ProcStar-III boards8 Stratix-III E260 FPGAs total40GB (17×2+6) DDR2/3 RAM
After capacity doubled, total power of Novo-G @ max. load 8KW
11
Novo-G ToolsNovo-G Tools Commercial and open-source tools
Digital design tools: Altera, GiDEL, Aldec, Synopsys Cores and libraries: Altera, GiDEL, et al. High-level device design: Altera FP Compiler,
Impulse-C, Mitrion-C, LabVIEW (2010) High-level system design: MPI, UPC, SHMEM Additional options in review (ROCCC, et al.)
Variety of CHREC tools being ported & used for Novo-G Strategic design & prediction: RCML, RCSE, RAT, CMD High-level system design: SHMEM+, SCF Hardware virtualization for fast PAR: IFET App verification & performance analysis: ReCAP Proposed OpenCL over CHREC-IF Assorted kernel & app cores
Industry Partners
12
Impulse-C Platform Support Impulse-C Platform Support PackagePackage
12
Impulse-C Allows software written in Impulse-C programming
language to run in Novo-G FPGAs H/W – S/W partitioning approach
S/W processes compiled to executable using GCC H/W processes converted to synthesizable
VHDL/Verilog
Platform Support Package (PSP) Provides interface between Impulse-C generated
H/W and S/W customized for Impulse-C application Currently supports streams and registers Future Work:
Provide support for shared memory Extend PSP to support Multi-FPGA system
Impulse-C apps on Novo-G Smith-Waterman Back-projection European Option Pricing
CPU
PC
Ie x8
Novo-G Node
Stratix-III E260 @ 125MHz
S/W application
Impulse-C API
Impulse-C code
Generate H/W
Generate S/W
PSP
Impulse Generated
H/W
… …
… …
Register
Stream
Stream
Register
Register
Stream
Stream
Register
Impulse-C PSPHardware Software
H/W - S/W partitioned
Impulse-C code
H/W process code (VHDL)
S/W process code (C)
ProcWizard project (PCAF)
Bitfile (rbf)
Gidel - Impulse Interface (VHDL)
S/W interface code (C/C++)
From PSP From PSP
Compiled API
Executable
Quartus IISynthesis
GCCCompile
GCCCompile
Gidelwrapper
Headerfile
14
Mitrionics Virtual Processor Mitrionics Virtual Processor (MVP)(MVP)
14
Mitrion-C apps on Novo-GMitrion-C apps on Novo-G AES app for SC09
Fully pipelined Fully unrolled Full performance to the
theoretical limit of bandwidth
Massively parallel MVP Provides abstraction layer
between software and FPGA hardware
Allows software written in Mitrion-C programming language to run in Novo-G FPGAs
Has unique architecture that adapts hardware in FPGAs to each program to maximize its performance.
Mitrion on Novo-GMitrion on Novo-G Operational:
Hardware interface Mithal API support for
GiDEL Currently working on:
Performance optimization Additional functionality
Future work: Expand API support for
multiple FPGAs on single Novo-G node
Support for all 24 nodes and 192 FPGAs
Novo-G Node
E5520 Nehalem Quad-Core Xeon Mitrion Accelerated Application
Hardware Interface
Stratix-III E260 @ 125MHz
128bit I/OStreams
2GB Mem 2GB Mem
4 64BitIn-Regs
4 64BitOut-Regs
MVP
Mitrion Host Abstraction Layer
PCIe x8
Planned apps / app areas Bioinformatics Information retrieval and
search engines Database acceleration
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Smith-Waterman (S-W) is an algorithm used to compute the optimal local sequence alignment of two or more character strings.
Needleman-Wunsch (N-W) is for the computation of the optimal global sequence alignment.
In biology, alignments are performed in search of sequence similarities under the assumption that they imply functional, structural, or evolutionary relationships between sequences and their sources.
Contemporary implementations of optimal sequence alignment (whether global, local, or anything in-between) are based on a computation-intensive dynamic programming algorithm that breaks down the process of alignment into a set of recursive computations.
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Algorithms involve calculation of
optimal alignment for all possible subsequences, then choosing the final sequence alignment from set of sub-alignments.
Equivalent to populating a score matrix and selecting the appropriate cell based on the type of alignment desired
Example of local alignment (S-W)
Query Sequence = “ACGTATGC”
Database Sequence = “ACGAACCCTTGC”
Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics For two sequences of length A and B, optimum alignment requires the calculation of A∙B scores, with serial implementations operating in O(A∙B) time and O(min{A,B}) space complexity.
As amount of sequence data grows exponentially, the need for faster sequence alignment has fuelled the development of hardware accelerators.
Hardware Approach:
18
Novo-G Apps: Smith-Waterman Novo-G Apps: Smith-Waterman (S-W)(S-W) First completed app: S-W Kernel for use in Bio Apps Locally/Optimally align DNA, RNA, or protein sequences Identify regions of similarity; dominant & vital app. in comp. biology Optimal alignment Ideal but often replaced with much faster heuristics
Design: systolic array spanning 4 FPGAs per board 512 PE/FPGA, 2048 PE/board, 1 board/server, 125MHz, see app-note
Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core
= 743,460 Seconds (≈8.6 Days)
Number of Novo-G Nodes in Execution
1 4 8 12 16 24(E)
Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886
Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053
Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core
= 743,460 Seconds (≈8.6 Days)
Number of Novo-G Nodes in Execution
1 4 8 12 16 24(E)
Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886
Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053
Novo-G achieves in ~12 seconds what takes a fast CPU core nearly 9 days!
Speed of S-W on Novo-G comparable to two largest machines on NSF TeraGrid After our 2x upgrade, fast as both combined!
Yet, Novo-G is 100s of times lower in energy, cooling, cost, size, weight, etc. than TeraGrid
Future Plans:• Use S-W Kernel in SHRiMP application
as replacement for BLAST heuristic
Execution times for 34MB chromosome sequence aligned with 16K 128-character sequences
Novo-G Apps: Needleman-Wunsch Novo-G Apps: Needleman-Wunsch (N-W)(N-W)
Execution Time of needledist Baseline on Single 2.26GHz Intel Nehalem Core
=55,200 Seconds (≈ 15.4 hours)
Number of Novo-G FPGAs in Execution
1 2 4 8 96 192
Execution Time (Sec) of Novo-G 50.6 25.9 14.7 8.39
Novo-G Speedup vs. Single Core 1091 2131 3755 6579 78951 157902Execution times for distance calculation of 16,777,216 pairs of length 250. Note: Red cells extrapolated values, obtainable with larger data sets.
N-W Kernel for use in ICBR’s ESPRIT application Globally/Optimally align DNA sequences, then computes edit distance Edit distances used to group sequences into operational taxonomic units (OTU) OTUs grouped into tree; tree represents species richness and taxonomy
Design: systolic array of PEs with I/O FIFOs for streaming Current Design: 250 PEs/FPGA, 1000 PEs/board, 2 boards/server, 125MHz Design only consumes 68% of chip; number of PEs will be increased Compared to S-W, overall design more simple in terms of control signals but
N-W PEs vastly more complex Uses special encoding scheme that allows N-W to approach S-W performance
Future Plans:• Fully Integrate N-W
Kernel into ESPRIT and Create Web App for use by Scientific Community
USED IN ACTUAL METOGENOMICS
RESEARCH!
Novo-G Apps: Real-Time Adaptive Novo-G Apps: Real-Time Adaptive FilteringFiltering Use of ITL optimization in real-time adaptive filtering
Filter weights change with every sample through feedback by optimizing value of cost function
Current filters minimize mean squared error (MSE) cost functions; ITL cost function minimize error entropy (EE) yielding better results.
Minimizing EE equivalent to calculating gradient of information potential (IP) :
i j
ee jiexp
Increasing IP window size (i, j) results in smoother and faster convergence but computation increases as O(n2)
Novo-G implementation advantages SW implementation (MATLAB) cant operate in RT for large window size All summed exponential terms independent; HW can compute in
parallel Fist design iteration can compute window size up to (50, 50) on 1 FPGA
in single clock cycle Clockrate/Speedup limited by sample frequency for RT filtering
Window size 5 40 50
Maximum Sample
Frequency (kHz)
Matlab 12.99 4.35 3.03
1 FPGA 11,764.7 10,204.1 9,803.9
Speedup 906 2346 3236
Sample frequencies based on simulated time to calculate a single weight (computation of IP). Does not include FPGA transfer time.
21
Novo-G Apps: Filtered Novo-G Apps: Filtered Back-Projection Back-Projection (FBP)(FBP) BP for use in CT image reconstruction
2D object is reconstructed from several 1-D projections Projections obtained by bombarding object with X-ray beam from multiple angles Each pixel on projected image represents total absorption of X-ray along path from source
to detector Mathematically, transformation from projection-space into Cartesian coordinates
Design: 512 pipelined processing engines per FPGA
DesignCPU
(4 cores)4 FPGA
(VHDL)
4 FPGA
(Impulse C)
Total Time 2300ms 6.81ms 8.8 ms
Speedup - 338X 261X
Embarrassingly parallel w.r.t. computation of each pixel as well as projections for each
Processing engines iterate over all pixels and compute partial sum; final image formed in software
Software time complexity is O(n3); Hardware design reduces complexity to O(n2)
H/W implementation uses 16-bit fixed point arithmetic; results are visually indistinguishable from DPFP
S/W baseline: C code executed with fixed point on Intel E5520 Nahalem Quad Core Xeon @ 2.26GHz
• Design implemented in both Impulse-C and VHDL to compare performance and productivity
• Performance loss of 1.29x but estimated productivity gain considerably greater