26
Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1 , Mohamed Taher 1 , Tarek El- Ghazawi 1 , Mohamed Abouellail 1 , Nandakishore Sastry 2 , and Kris Gaj 2 1 The George Washington University, 2 George Mason University

Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

Embed Size (px)

Citation preview

Page 1: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

Efficient Implementation of a String Matching Algorithm for SRC and Cray

Reconfigurable Computers

Efficient Implementation of a String Matching Algorithm for SRC and Cray

Reconfigurable Computers

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1,

Mohamed Abouellail1, Nandakishore Sastry2, and Kris Gaj2

1The George Washington University,2George Mason University

Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1,

Mohamed Abouellail1, Nandakishore Sastry2, and Kris Gaj2

1The George Washington University,2George Mason University

Page 2: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 2 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 3: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 3 1017 / MAPLD2005

Introduction

Interface

P memory

P memory

. . .

P P . . .

I/O Interface

FPGA memory

FPGA memory

. . .

FPGA FPGA . . .

I/O

Microprocessor System Reconfigurable Processor System

Page 4: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 4 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 5: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 5 1017 / MAPLD2005

Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier

Up to 256 input and 256 output ports with two tiers of switch

Common Memory (CM) has controller with DMA capability

Controller can perform other functions such as scatter/gather

Up to 8 GB DDR SDRAM supported per CM node

SRC Architecture(Hi-BarTM Based Systems)

Storage Area Storage Area Network Network

Local Area Local Area Network Network

Wide Area Wide Area Network Network DiskDisk

Customers’ Existing NetworksCustomers’ Existing Networks

PCI-XPCI-XPCI-XPCI-X

MAPMAP®®

SRC-6SRC-6

MAPMAP

PP

MemoryMemory

SNAPSNAP™™

PP

MemoryMemory

SNAPSNAP

Gig EthernetGig Ethernetetc.etc.

Common Common MemoryMemory

ChainingChainingGPIOGPIO

Common Common MemoryMemory

SRC Hi-Bar SwitchSRC Hi-Bar Switch

Page 6: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 6 1017 / MAPLD2005

SRC Reconfigurable Processor

Page 7: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 7 1017 / MAPLD2005

SRC Programming Environment

SRC Programming Environment

P system

FPGAsystem

HLL (C) HDL (VHDL)

Page 8: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 8 1017 / MAPLD2005

SRC Programming Environment (cnt’d)

Objectfiles

Application sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker.bin files

.edf files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDLsources.c or .f files .vhd or .v files

Objectfiles

Application sourcesUser

Macro sources

MAP CompilerP Compiler

Logic synthesis

Place & Route

Linker

.edf files

.bin files

. files

.o files .o files

Applicationexecutable

Configurationbitstreams

HDL

.c or .f files .vhd or .v files

.v files

Page 9: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 9 1017 / MAPLD2005

Main program

Function_1(a, d, e)

Function_2(d, e, f)

Function_1

Function_2

Macro_1(a, b, c)

Macro_2(b, d)Macro_2(c, e)

Macro_3(s, t)

Macro_1(n, b)Macro_4(t, k)

FPGA……

……

……

Macro_1

Macro_2 Macro_2

a

b c

d e

FPGA contents afterthe Function_1 call

Program in C or Fortran

SRC Programming Environment (cnt’d)

Page 10: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 10 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 11: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 11 1017 / MAPLD2005

Cray XD1 System Architecture(One Chassis)

RapidArray components in a Cray XD1 chassis

FPGA and 2nd RAP are on Expansion Module

Compute 12 AMD Opteron 32/64

bit, x86 processors High Performance LinuxRapidArray Interconnect 12 communications

processors 1 Tb/s switch fabricActive Management Dedicated processorApplication Acceleration 6 co-processors

Page 12: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 12 1017 / MAPLD2005

Cray XD1 Application Acceleration Interfaces

XC2VP30-50 running at up to 200 MHz 4 QDR II RAM  with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s) 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s) QDR and HT I/F take up <20 % of XC2VP30.  The rest is available for user

applications

UserLogic

ADDR(20:0)D(35:0)Q(35:0)

TX

RX

RapidArrayTransport

ADDR(20:0)D(35:0)Q(35:0)

ADDR(20:0)D(35:0)Q(35:0)

ADDR(20:0)D(35:0)Q(35:0)

RapidArrayTransport

Core

QDR RAM Interface Core

QDR II SRAMRAP

Virtex-II Pro

Page 13: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 13 1017 / MAPLD2005

Cray XD1 Development Flow

Hardware Flow Software Flow

Standard Hardware Flow

Page 14: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 14 1017 / MAPLD2005

Cray XD1 Hardware Development Flow

Standard Flow Additional High-Level

Tools

Page 15: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 15 1017 / MAPLD2005

Design Methodology using Cray XD1

Write application in C for system microprocessor

Identify computation intense routine(s)

Generate a bitstream using Cray Cores (RT & QDRII) and language of choice Create module in HDL (Verilog, VHDL) Create module using High Level Language Tools Validate Module Synthesize using (XST, Leonardo, Synplify Pro) Create bitstream using Xilinx place & route tools

Replace routines with Cray API calls

Run Application

Page 16: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 16 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 17: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 17 1017 / MAPLD2005

String Matching - Introduction

String Matching – detecting the occurrence of a particular substring, called the pattern, in another string, called the text

Types of String matching: Exact string matching Approximate string matching

Exact string matching: Involves match patterns, where they exist completely, that

is unbroken and with no irrelevant data in between any letters

Numerous Applications : NIDS, text editing, …etc.

Approximate string matching: Pattern rarely matches the text completely Finds application in Computational biology (DNA matching),

image detection, handwriting recognition…etc.

Page 18: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 18 1017 / MAPLD2005

Why align two protein or DNA sequences? Determine whether they are

descended from a common ancestor (homologous)

Infer a common function Locate functional elements Infer protein structure, if the

structure of one of the sequences is known

Problem: find the best pairwise alignment

of GAATC and CATAC

DNA Matching Basics

GAATCCATAC

GAATC-CA-TAC

GAAT-CC-ATAC

GAAT-CCA-TAC

-GAAT-CC-A-TAC

GA-ATCCATA-C

We need a way to measure the quality of a candidate alignment

Alignment scores consist of two parts: substitution matrix gap penalty

Page 19: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 19 1017 / MAPLD2005

Purine A G

Pyrimidine C T

Transition (cheap)

Transversion(expensive)

10-5 0-5T

-510-5 0G

0-510-5C

-5 0-510A

TGCA

A hypothetical substitution matrix

GAAT-CCA-TAC

-5 + 10 + ? + 10 + ? + 10 = ?

GAAT-C d=-4CA-TAC

-5 + 10 + -4 + 10 + -4 + 10 = 17

G--AATC d=-4CATA--C e=-1-5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

DNA Matching Basics (cnt’d)

Scoring aligned bases

Scoring gaps

Linear gap penalty: every gap receives a score of d

Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e

Page 20: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 20 1017 / MAPLD2005

Read sequencesA & B

Into two arrays

Set traceback & Similarity matrix to

(A+1) * (B+1)

1’s row & column ofSimilarity Matrix = 0

Initialize traceback Arrays by setting to

-1 (default value)

Compute SimilarityMatrix [i] [j]

Update tracebackArray

Traceback for best alignments

0

_1,_,1

,1,1

max,penaltygapjiFpenaltygapjiF

yxsjiF

jiFji

NOTE: Traceback array carries the coordinates of one of three cells involved in the calculation of the cell [i] [j] in the similarity matrix

no

A

A

yes

Similarity Matrix Complete?

Approximate String Matching Algorithm(Smith-Waterman Algorithm)

Page 21: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 21 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 22: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 22 1017 / MAPLD2005

Software Only Implementation

Software/HardwareImplementation

Hardware OnlyImplementation

C functionfor P

C functionfor MAP

VHDL

VHDLMacro

P System

FPGASystem

Implementation Schemes in SRC

Page 23: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 23 1017 / MAPLD2005

Operational Environment

Operational Scenarios for Cray XD1

µP-Initiated Transfers

FPGA-Initiated Transfers

Write-Only Transfers

Page 24: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 24 1017 / MAPLD2005

Outline

Introduction

SRC Hardware & Software

Cray XD1 Hardware & Software

String Matching Algorithms

Implementation Methodology

Results and Comparisons

Conclusions

Page 25: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 25 1017 / MAPLD2005

Performance Results

Rate = (FPGA freq.) X (cycles/cell) X (# SWPEs)

Opteron Implementation (SSEARCH34)*

100 Million Cell Updates Per Second (CUPS)

Cray Inc. Implementation*

Current unoptimized design 80 MHz X 1 X 32 = 2.56 Billion CUPS (GCUPS)

With optimization 100 MHZ x 1 x 50 = 5.0 GCUPS

With future Virtex 4 FPGA 100 MHZ x 1 x 150 = 15 GCUPS

25x speedup vs. Opteron

Our Implementation SRC-6

Current unoptimized design» 100 MHz X 1 X (16x16) = 25.6 GCUPS

10x speedup vs. Cray 256x speedup vs. Opteron

Cray XD1 Current unoptimized design

» 200 MHz X 1 X (16x16) = 51.2 GCUPS 20x speedup vs. Cray 512x speedup vs. Opteron *CUG’05, New Mexico, May 2005

Page 26: Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,

El-Araby 26 1017 / MAPLD2005

Conclusions

Smith-Waterman sequence alignment algorithm has been implemented on both SRC-6 and Cray XD1 systems

Similarities and differences are highlighted with regard to: System hardware architecture Ease of programming

Programming model Development time Hardware/software libraries

Performance The speed-up vs. microprocessor is reported

Primary bottlenecks limiting the performance of both systems are recognized

The capability to share and port applications between the SRC and Cray systems is explored