Download ppt - EEL 4930/5934, Fall 09 December 3-4, 2009 Novo-G : Adaptively Custom Reconfigurable Supercomputer Dr. Alan D. George Professor of ECE University of Florida

EEL 4930/5934, Fall 09

December 3-4, 2009

Novo-G : Adaptively Custom Novo-G : Adaptively Custom Reconfigurable Reconfigurable SupercomputerSupercomputer

Dr. Alan D. GeorgeProfessor of ECE

University of Florida

Dr. Herman Lam Assoc. Professor of ECE

University of Florida

Abhijeet LawandeCarlo PascoeResearch Assistants

CHRECUniversity of Florida

High Performance High Performance ComputingComputing Uses supercomputers / distributed computers to solve advanced computation problems

Where? Computational Fluid Dynamics Astrophysical Simulations Climate Modeling …

How Big? 100’s of nodes,

1000’s of processors

3

HPC MarketplaceHPC Marketplace HPC practitioners often more reactive than proactive

Understandably conservative, risk-averse Looking for quick fixes (not always best approach for long-term)

Accelerators (e.g. GPU, Cell) popular @ SC09 But these consume more (energy) to get more (performance)

Performance promising for subset of apps (on fixed-logic spectrum) Productivity a significant challenge (common in Age of Parallelism) Sustainability a major concern (single devices approaching 300W!)

But better solutions borne from better methods Goal: high performance, productivity, & sustainability Change in paradigm, mindset, approach

“Every generation needs a new revolution” – Jefferson Smarter device and system architectures

Adaptive hardware parallelism, more (performance) with less (energy) Better models & solutions apply more broadly than only HPC

Reconfigurable ComputingReconfigurable Computing Why RC?

Performance (parallelism) Power Price

So what’s the problem? New computing model: revolutionary, potent, complex

Adaptive hardware offers many challenges & opportunities Still relatively new and immature field

Many open R&D issues, from prog. model to device arch.

5

Novo-G ConceptNovo-G Concept Goals

Investigate, develop, evaluate, & showcase: Most powerful RC machine ever fielded for research Innovative suite of productivity tools for app development Impactful set of scalable kernels/apps in key science areas

Project & machine name: Novo-G “Novo” is Latin: "to make anew, refresh, revive, change, alter," essence of RC “G” for Genesis (first of a series of Novo machines) or Green

Focus on experimental research challenges of RC spanning HPC to HPEC Motivations

Design productivity is foremost need/challenge for widespread use of RC Challenges accentuated as scale increases (devices, systems, apps) Powerful experimental testbed to support R&D addressing these challenges

Emphases Performance (system), Productivity (concepts/tools), Impact (apps)

6

Novo-G MachineNovo-G Machine Cluster of 24+1 servers (compute + head node)

96 Altera Stratix-III E260 FPGAs for app acceleration Each w/ 768 18x18 multipliers, 254K logic elements, 204K registers, power <20W

e.g. Per E260: 768 Integer, 192 SPFP, or 85 DPFP multipliers @ ~300MHz (Altera FPC)

FPGAs housed in four quad-FPGA PCIe x8 GiDEL boards Embedded-style boards; supports both HPEC- & HPC-oriented research

4.25GB memory attached to each app FPGA, 576GB total RAM in Novo-G

24 boards housed in 24 Linux compute servers + head node 20Gb/s non-blocking DDR InfiniBand; Gigabit Ethernet

26 (24+2) quad-core 2.26GHz Intel Nehalem Xeon processors w/ QPI

Funded by U. Florida w/ generous help from Altera & GiDELUPDATE: Novo-G will soon double in RC capacity,

growing to 192 top-end FPGAs in 48 quad-FPGA boards

Novo-G MachineNovo-G Machine1 head-node server with:• 1U rackmount chassis• Dual Xeon E5520 quad-core CPUs @

2.26 GHz, 4MB Cache, 5.86 GT/s QPI• 24GB ECC DDR3, 1333 MHz• Integrated dual-GigE ports & video• ICH10R controller for 6 SATA drives• 3 x 1TB Enterprise SATA2 drives

7

KVM/LCD unit for head node

* Our cluster vendor is Ace Computers

24 compute servers, each with:• 4U rackmount chassis with 645W P/S• Intel Xeon E5520 quad-core CPU • 6GB ECC DDR3, 1333 MHz• Integrated dual-GigE ports & video• 2 GiDEL ProcStar-III PCIe x8 cards• Mellanox DDR InfiniBand PCIe card• 250GB SATA2 drive

Not visible (IB & GigE switches, PDUs)

Altera Stratix-III E260 FPGA254,400 Logic Elements768 multipliers (18×18)14,688 Kbits of embedded memory50% less power than Stratix-II65nm technology

8

Novo-G ProcStar-III Board Novo-G ProcStar-III Board (one of 48)(one of 48) 2×2GB = 4GB

DDR2 RAM per FPGA

25.6 GB/s inter-FPGA bandwidth110 lines bi-directional

/ 110 lines bi-directional/

110 linesbi-directional/

256MB DDR2 256MB

DDR2 256MB DDR2 256MB

DDR2PCIe x8 interface (4GB/s)

GiDEL ProcStar-III BoardTypical frequencies 100-325MHz DMA channels 32DDR2 module slots 8

120.8 mm

312 mm

JTAG for SignalTap debug

FPGA2

Novo-G Memory & Novo-G Memory & ConnectivityConnectivity Head

node24 GB DDR3

Compute node

6 GB DDR3

FPGA1FPGA2 + memory

FPGA3 + memory

FPGA4 + memory

Compute nodes

Main bus

2x 2GB DDR2

SODIMM

256 MB DDR2

667MHz

PCI-Express x8

GigE

Infiniband

2x 2GB DDR2

SODIMM

Memory Bus

Power Consumption of Each Novo-G Server

111

47

210

52

57

117

0

50

100

150

200

250

300

350

Server Only Boards Only Server+Boards

Wat

ts Loaded

Idle

10

Novo-G Energy Novo-G Energy (each of 24 (each of 24 servers)servers)

Smith-Waterman applicationQuad-core E5520 Xeon CPU2 GiDEL ProcStar-III boards8 Stratix-III E260 FPGAs total40GB (17×2+6) DDR2/3 RAM

Smith-Waterman applicationQuad-core E5520 Xeon CPU2 GiDEL ProcStar-III boards8 Stratix-III E260 FPGAs total40GB (17×2+6) DDR2/3 RAM

After capacity doubled, total power of Novo-G @ max. load 8KW

11

Novo-G ToolsNovo-G Tools Commercial and open-source tools

Digital design tools: Altera, GiDEL, Aldec, Synopsys Cores and libraries: Altera, GiDEL, et al. High-level device design: Altera FP Compiler,

Impulse-C, Mitrion-C, LabVIEW (2010) High-level system design: MPI, UPC, SHMEM Additional options in review (ROCCC, et al.)

Variety of CHREC tools being ported & used for Novo-G Strategic design & prediction: RCML, RCSE, RAT, CMD High-level system design: SHMEM+, SCF Hardware virtualization for fast PAR: IFET App verification & performance analysis: ReCAP Proposed OpenCL over CHREC-IF Assorted kernel & app cores

Industry Partners

12

Impulse-C Platform Support Impulse-C Platform Support PackagePackage

12

Impulse-C Allows software written in Impulse-C programming

language to run in Novo-G FPGAs H/W – S/W partitioning approach

S/W processes compiled to executable using GCC H/W processes converted to synthesizable

VHDL/Verilog

Platform Support Package (PSP) Provides interface between Impulse-C generated

H/W and S/W customized for Impulse-C application Currently supports streams and registers Future Work:

Provide support for shared memory Extend PSP to support Multi-FPGA system

Impulse-C apps on Novo-G Smith-Waterman Back-projection European Option Pricing

CPU

PC

Ie x8

Novo-G Node

Stratix-III E260 @ 125MHz

S/W application

Impulse-C API

Impulse-C code

Generate H/W

Generate S/W

PSP

Impulse Generated

H/W

… …

… …

Register

Stream

Stream

Register

Register

Stream

Stream

Register

Impulse-C PSPHardware Software

H/W - S/W partitioned

Impulse-C code

H/W process code (VHDL)

S/W process code (C)

ProcWizard project (PCAF)

Bitfile (rbf)

Gidel - Impulse Interface (VHDL)

S/W interface code (C/C++)

From PSP From PSP

Compiled API

Executable

Quartus IISynthesis

GCCCompile

GCCCompile

Gidelwrapper

Headerfile

14

Mitrionics Virtual Processor Mitrionics Virtual Processor (MVP)(MVP)

14

Mitrion-C apps on Novo-GMitrion-C apps on Novo-G AES app for SC09

Fully pipelined Fully unrolled Full performance to the

theoretical limit of bandwidth

Massively parallel MVP Provides abstraction layer

between software and FPGA hardware

Allows software written in Mitrion-C programming language to run in Novo-G FPGAs

Has unique architecture that adapts hardware in FPGAs to each program to maximize its performance.

Mitrion on Novo-GMitrion on Novo-G Operational:

Hardware interface Mithal API support for

GiDEL Currently working on:

Performance optimization Additional functionality

Future work: Expand API support for

multiple FPGAs on single Novo-G node

Support for all 24 nodes and 192 FPGAs

Novo-G Node

E5520 Nehalem Quad-Core Xeon Mitrion Accelerated Application

Hardware Interface

Stratix-III E260 @ 125MHz

128bit I/OStreams

2GB Mem 2GB Mem

4 64BitIn-Regs

4 64BitOut-Regs

MVP

Mitrion Host Abstraction Layer

PCIe x8

Planned apps / app areas Bioinformatics Information retrieval and

search engines Database acceleration

Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Smith-Waterman (S-W) is an algorithm used to compute the optimal local sequence alignment of two or more character strings.

Needleman-Wunsch (N-W) is for the computation of the optimal global sequence alignment.

In biology, alignments are performed in search of sequence similarities under the assumption that they imply functional, structural, or evolutionary relationships between sequences and their sources.

Contemporary implementations of optimal sequence alignment (whether global, local, or anything in-between) are based on a computation-intensive dynamic programming algorithm that breaks down the process of alignment into a set of recursive computations.

Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics Algorithms involve calculation of

optimal alignment for all possible subsequences, then choosing the final sequence alignment from set of sub-alignments.

Equivalent to populating a score matrix and selecting the appropriate cell based on the type of alignment desired

Example of local alignment (S-W)

Query Sequence = “ACGTATGC”

Database Sequence = “ACGAACCCTTGC”

Sequence Alignment in Sequence Alignment in BioinformaticsBioinformatics For two sequences of length A and B, optimum alignment requires the calculation of A∙B scores, with serial implementations operating in O(A∙B) time and O(min{A,B}) space complexity.

As amount of sequence data grows exponentially, the need for faster sequence alignment has fuelled the development of hardware accelerators.

Hardware Approach:

18

Novo-G Apps: Smith-Waterman Novo-G Apps: Smith-Waterman (S-W)(S-W) First completed app: S-W Kernel for use in Bio Apps Locally/Optimally align DNA, RNA, or protein sequences Identify regions of similarity; dominant & vital app. in comp. biology Optimal alignment Ideal but often replaced with much faster heuristics

Design: systolic array spanning 4 FPGAs per board 512 PE/FPGA, 2048 PE/board, 1 board/server, 125MHz, see app-note

Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core

= 743,460 Seconds (≈8.6 Days)

Number of Novo-G Nodes in Execution

1 4 8 12 16 24(E)

Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886

Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053

Execution Time of Serial Baseline on Single 2.4 GHz Opteron Core

= 743,460 Seconds (≈8.6 Days)

Number of Novo-G Nodes in Execution

1 4 8 12 16 24(E)

Execution Time (Sec) of Novo-G 279 70.4 35.6 24.2 18.2 12.38886

Novo-G Speedup vs. Single Core 2665 10561 20884 30721 40849 60053

Novo-G achieves in ~12 seconds what takes a fast CPU core nearly 9 days!

Speed of S-W on Novo-G comparable to two largest machines on NSF TeraGrid After our 2x upgrade, fast as both combined!

Yet, Novo-G is 100s of times lower in energy, cooling, cost, size, weight, etc. than TeraGrid

Future Plans:• Use S-W Kernel in SHRiMP application

as replacement for BLAST heuristic

Execution times for 34MB chromosome sequence aligned with 16K 128-character sequences

Novo-G Apps: Needleman-Wunsch Novo-G Apps: Needleman-Wunsch (N-W)(N-W)

Execution Time of needledist Baseline on Single 2.26GHz Intel Nehalem Core

=55,200 Seconds (≈ 15.4 hours)

Number of Novo-G FPGAs in Execution

1 2 4 8 96 192

Execution Time (Sec) of Novo-G 50.6 25.9 14.7 8.39

Novo-G Speedup vs. Single Core 1091 2131 3755 6579 78951 157902Execution times for distance calculation of 16,777,216 pairs of length 250. Note: Red cells extrapolated values, obtainable with larger data sets.

N-W Kernel for use in ICBR’s ESPRIT application Globally/Optimally align DNA sequences, then computes edit distance Edit distances used to group sequences into operational taxonomic units (OTU) OTUs grouped into tree; tree represents species richness and taxonomy

Design: systolic array of PEs with I/O FIFOs for streaming Current Design: 250 PEs/FPGA, 1000 PEs/board, 2 boards/server, 125MHz Design only consumes 68% of chip; number of PEs will be increased Compared to S-W, overall design more simple in terms of control signals but

N-W PEs vastly more complex Uses special encoding scheme that allows N-W to approach S-W performance

Future Plans:• Fully Integrate N-W

Kernel into ESPRIT and Create Web App for use by Scientific Community

USED IN ACTUAL METOGENOMICS

RESEARCH!

Novo-G Apps: Real-Time Adaptive Novo-G Apps: Real-Time Adaptive FilteringFiltering Use of ITL optimization in real-time adaptive filtering

Filter weights change with every sample through feedback by optimizing value of cost function

Current filters minimize mean squared error (MSE) cost functions; ITL cost function minimize error entropy (EE) yielding better results.

Minimizing EE equivalent to calculating gradient of information potential (IP) :

i j

ee jiexp

Increasing IP window size (i, j) results in smoother and faster convergence but computation increases as O(n2)

Novo-G implementation advantages SW implementation (MATLAB) cant operate in RT for large window size All summed exponential terms independent; HW can compute in

parallel Fist design iteration can compute window size up to (50, 50) on 1 FPGA

in single clock cycle Clockrate/Speedup limited by sample frequency for RT filtering

Window size 5 40 50

Maximum Sample

Frequency (kHz)

Matlab 12.99 4.35 3.03

1 FPGA 11,764.7 10,204.1 9,803.9

Speedup 906 2346 3236

Sample frequencies based on simulated time to calculate a single weight (computation of IP). Does not include FPGA transfer time.

21

Novo-G Apps: Filtered Novo-G Apps: Filtered Back-Projection Back-Projection (FBP)(FBP) BP for use in CT image reconstruction

2D object is reconstructed from several 1-D projections Projections obtained by bombarding object with X-ray beam from multiple angles Each pixel on projected image represents total absorption of X-ray along path from source

to detector Mathematically, transformation from projection-space into Cartesian coordinates

Design: 512 pipelined processing engines per FPGA

DesignCPU

(4 cores)4 FPGA

(VHDL)

4 FPGA

(Impulse C)

Total Time 2300ms 6.81ms 8.8 ms

Speedup - 338X 261X

Embarrassingly parallel w.r.t. computation of each pixel as well as projections for each

Processing engines iterate over all pixels and compute partial sum; final image formed in software

Software time complexity is O(n3); Hardware design reduces complexity to O(n2)

H/W implementation uses 16-bit fixed point arithmetic; results are visually indistinguishable from DPFP

S/W baseline: C code executed with fixed point on Intel E5520 Nahalem Quad Core Xeon @ 2.26GHz

• Design implemented in both Impulse-C and VHDL to compare performance and productivity

• Performance loss of 1.29x but estimated productivity gain considerably greater