30
A Scalable FPGA-based Multiprocessor Arun Patel 1 , Christopher A. Madill 2,3 , Manuel Saldaña 1 , Christopher Comis 1 , Régis Pomès 2,3 , Paul Chow 1 Presented By: Arun Patel and Christopher Madill {apatel, cmadill}@eecg.toronto.edu IEEE 2006 Conference on Field-Programmable Custom Computing Machines Napa Valley, California April 25 th , 2006 1: Department of Electrical and Computer Engineering, University of Toronto 2: Department of Structural Biology and Biochemistry, The Hospital for Sick Children 3: Department of Biochemistry, University of Toronto

A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

Embed Size (px)

Citation preview

Page 1: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

A Scalable FPGA-based Multiprocessor

Arun Patel1, Christopher A. Madill2,3, Manuel Saldaña1, Christopher Comis1, Régis Pomès2,3, Paul Chow1

Presented By:Arun Patel and Christopher Madill

{apatel, cmadill}@eecg.toronto.edu

IEEE 2006 Conference onField-Programmable Custom Computing Machines

Napa Valley, CaliforniaApril 25th, 2006

1: Department of Electrical and Computer Engineering, University of Toronto2: Department of Structural Biology and Biochemistry, The Hospital for Sick Children3: Department of Biochemistry, University of Toronto

Page 2: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 2

Introduction

– FPGAs can accelerate many computing tasks by up to 2 or 3 orders of magnitude

– Supercomputers and computing clusters have been designed to improve computing performance.

– Our work focuses on developing a powerful computing cluster based on a scalable network of FPGAs

– Initial design will be tailored for performing Molecular Dynamics simulations

Page 3: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 3

1. Calculate interatomic forces.

2. Calculate the net force.

3. Integrate Newton’s equations of motion.

Molecular Dynamics

– Combines empirical force calculations with Newton’s equations of motion.

– Predict the time trajectory of small atomic systems.

– Computationally demanding. 1

mFa

tattvttrttr

25.0

ttatattvttv 5.0

F

Page 4: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 4

Molecular Dynamics

BondsAll

ob llk 2)(

AnglesAll

ok 2)(

TorsionsAll

nA )]cos(1[

PairsAll rr

612

4

PairsAll r

qq 21

U =

+

+

+

+

Page 5: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 5

Why Molecular Dynamics?

2. Computationally Demanding

30 CPU Years

1. Inherently Parallelizable

Page 6: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 6

Example of MD at Work

Page 7: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 7

Motivation for Architecture

• Majority of hardware accelerators achieve ~102-103x improvement over S/W by– Pipelining a serially-executed algorithm

- or -– Performing operations in parallel

• Such techniques do not address large-scale computing applications (such as MD)– Much greater speedups are required (104-105)– Not likely with a single hardware accelerator

• Ideal solution for large-scale computing?– Scalability of modern HPC platforms– Performance of hardware acceleration

Page 8: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 8

Large-Scale Computing Solutions

• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs Interconnection Network

Page 9: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 9

Large-Scale Computing Solutions

• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs

• Class 2 Machines– Hybrid network of CPU and FPGA hardware– FPGA acts as external co-processor to CPU– Programming model still evolving

Interconnection Network

Interconnection Network

Page 10: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 10

Large-Scale Computing Solutions

• Class 1 Machines– Supercomputers or clusters of workstations– ~10-105 interconnected CPUs

• Class 2 Machines– Hybrid network of CPU and FPGA hardware– FPGA acts as external co-processor to CPU– Programming model still evolving

• Class 3 Machines– Network of FPGA-based computing nodes– Recent area of academic and industrial focus

Interconnection Network

Interconnection Network

Interconnection Network

Page 11: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 11

The “TMD” Machine

• An investigation of a Class 3 architecture– Designed for applications that exhibit high compute-to-

communication ratio– Made possible by integration of microprocessors, high-speed

communication interfaces into modern FPGA packages

• Design Features– Distributed memory model– Low-latency point-to-point interconnection network– Provides abstraction of uniform, extensible FPGA fabric to

system designers– Constructed entirely using commodity FPGA components– Does not address shared memory, external I/O issues

Page 12: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 12

TMD “Computing Tasks” (1/2)

• Computing Tasks– Applications are defined as collection of computing tasks – Tasks communicate by passing messages

• Task Implementation Flexibility– Software processes executing on embedded microprocessors– Dedicated hardware computing engines

Task

ComputingEngine

EmbeddedMicroprocessor

Processor onCPU Node

Clas

s 3

Clas

s 1

Page 13: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 13

TMD “Computing Tasks” (2/2)

• Computing Task Granularity– Tasks can vary in size and complexity– Not restricted to one task per FPGA

FPGAs Tasks

A

B

C

D E F

G H I

J K L M

Page 14: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 14

TMD Communication Infrastructure

• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined

Page 15: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 15

TMD Communication Infrastructure

• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined

• Tier 2: Inter-FPGA Communication– Multi-gigabit serial transceivers used for inter-FPGA communication– Fully-interconnected network topology using 2N*(N-1) pairs of traces

Page 16: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 16

TMD Communication Infrastructure

• Tier 1: Intra-FPGA Communication– Point-to-Point FIFOs are used as communication channels– Asynchronous FIFOs isolate clock domains– Application-specific network topologies can be defined

• Tier 2: Inter-FPGA Communication– Multi-gigabit serial transceivers used for inter-FPGA communication– Fully-interconnected network topology using 2N*(N-1) pairs of traces

• Tier 3: Inter-Cluster Communication– Commercially-available switches interconnect cluster PCBs– Built-in features for large-scale computing: fault-tolerance, scalability

Page 17: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 17

Inter-Task Communication

• Based on Message Passing Interface (MPI)– Popular message-passing standard for distributed applications– Implementations available for virtually every HPC platform

• TMD-MPI– Subset of MPI standard developed for TMD architecture– Software library for tasks implemented on embedded

microprocessors– Hardware Message Passing Engine (MPE) for hardware

computing tasks

Page 18: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 18

TMD-MPI Software Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Page 19: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 19

TMD-MPI Software Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.

Page 20: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 20

TMD-MPI Software Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.

Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.

Page 21: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 21

TMD-MPI Software Implementation

Application

Hardware

MPI Application Interface

Point-to-Point MPI Functions

Send/Receive Implementation

FSL Hardware Interface

Layer 4: MPI InterfaceAll MPI functions implemented in TMD-MPI that are available to the application.

Layer 3: Collective OperationsBarrier synchronization, data gathering and message broadcasts.

Layer 2: Communication PrimitivesMPI_Send and MPI_Recv methods are used to transmit data between processes.

Layer 1: Hardware InterfaceLow level methods to communicate with FSLs for both on and off-chip communication.

Page 22: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 22

TMD Application Design Flow

• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines

ApplicationPrototype

Page 23: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 23

TMD Application Design Flow

• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines

• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine

network topology

ApplicationPrototype

Process A Process B Process C

Page 24: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 24

TMD Application Design Flow

• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines

• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine

network topology

• Step 3: TMD Prototyping– Tasks are ported to soft-processors on TMD– Software refined to utilize TMD-MPI library – On-chip communication network verified

ApplicationPrototype

Process A Process B Process C

A B C

Page 25: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 25

TMD Application Design Flow

• Step 1: Application Prototyping– Software prototype of application developed– Profiling identifies compute-intensive routines

• Step 2: Application Refinement– Partitioning into tasks communicating using MPI– Each task emulates a computing engine– Communication patterns analyzed to determine

network topology

• Step 3: TMD Prototyping– Tasks are ported to soft-processors on TMD– Software refined to utilize TMD-MPI library – On-chip communication network verified

• Step 4: TMD Optimization– Intensive tasks replaced with hardware engines– MPE handles communication for hardware

engines

ApplicationPrototype

Process A Process B Process C

A B C

B

Page 26: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 26

MD Software Implementation

Atom Store

r→

Atom Store

r→

Force Engine

Atom Store

r→F→

Force Engine Force Engine Force Engine

Atom Store

r→F→

F→

F→

mpiCC

Interconnection Network

Design Flow

– Testing and validation

– Parallel design

– Software to hardware transition

Page 27: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 27

Current Work

XC2VP100 XC2VP100

PPC-405 PPC-405

Force EngineForce Engine Force Engine

Atom StoreAtom StoreAtom Store

Force Engine

Atom Store

Atom Store+

TMD-MPI+

ppc-g++

Force Engine

C++ → HDL+

TMD-MPE+

Synthesis

• Replace software processes with hardware computing engines

Page 28: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 28

Future Work – Phase 2

TMD Version 2 Prototype

Page 29: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 29

Future Work – Phase 3

The final TMD architecture will contain a hierarchical network of FPGA chips

Page 30: A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented

4/25/2006 IEEE FCCM 2006 30

Acknowledgements

SOCRN

David Chui

Christopher Comis

Sam Lee

Dr. Paul Chow

Andrew House

Daniel Nunes

Manuel Saldaña

Emanuel Ramalho

Dr. Régis Pomès

Christopher Madill

Arun Patel

Lesley Shannon

TMD Group: Past Members: