21
Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition Karthik Nagarajan & Brian Holland & Alan D. George & K. Clint Slatton & Herman Lam Received: 28 April 2008 / Revised: 21 November 2008 / Accepted: 23 December 2008 / Published online: 16 January 2009 # 2009 Springer Science + Business Media, LLC. Manufactured in The United States Abstract Machine-learning algorithms are employed in a wide variety of applications to extract useful information from data sets, and many are known to suffer from super- linear increases in computational time with increasing data size and number of signals being processed (data dimension). Certain principal machine-learning algorithms are commonly found embedded in larger detection, estimation, or classifi- cation operations. Three such principal algorithms are the Parzen window-based, non-parametric estimation of Proba- bility Density Functions (PDFs), K-means clustering and correlation. Because they form an integral part of numerous machine-learning applications, fast and efficient execution of these algorithms is extremely desirable. FPGA-based recon- figurable computing (RC) has been successfully used to accelerate computationally intensive problems in a wide variety of scientific domains to achieve speedup over traditional software implementations. However, this potential benefit is quite often not fully realized because creating efficient FPGA designs is generally carried out in a laborious, case-specific manner requiring a great amount of redundant time and effort. In this paper, an approach using pattern-based decomposition for algorithm acceleration on FPGAs is proposed that offers significant increases in productivity via design reusability. Using this approach, we design, analyze, and implement a multi-dimensional PDF estimation algorithm using Gaussian kernels on FPGAs. First, the algorithms amenability to a hardware paradigm and expected speedups are predicted. After implementation, actual speedup and performance metrics are compared to the predictions, showing speedup on the order of 20× over a 3.2 GHz processor. Multi-core architectures are developed to further improve performance by scaling the design. Porta- bility of the hardware design across multiple FPGA plat- forms is also analyzed. After implementing the PDF algorithm, the value of pattern-based decomposition to support reuse is demonstrated by rapid development of the K-means and correlation algorithms. Keywords FPGA . Design patterns . Machine learning . Pattern recognition . Hardware acceleration . Performance prediction 1 Introduction Machine learning is a general term that has been used to refer to a large and diverse set of methods for extracting patterns from data. In general, it encompasses portions of statistical and adaptive signal processing, probabilistic decision theory, such as Bayes Rule, and meta-heuristic strategies, such as the genetic algorithm. As computational capacity has increased over the past few decades, machine- learning algorithms have become widely used for problems such as detection, estimation, and classification, involving diverse data types, including time series, image, and video J Sign Process Syst (2011) 62:4363 DOI 10.1007/s11265-008-0337-9 K. Nagarajan (*) : B. Holland : A. D. George : K. C. Slatton : H. Lam NSF Center for High-Performance Reconfigurable Computing (CHREC), Electrical and Computer Engineering Department, University of Florida, Gainesville, FL 32611, USA e-mail: [email protected] B. Holland e-mail: [email protected] A. D. George e-mail: [email protected] K. C. Slatton e-mail: [email protected] H. Lam e-mail: [email protected]

Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Embed Size (px)

Citation preview

Page 1: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Accelerating Machine-Learning Algorithms on FPGAsusing Pattern-Based Decomposition

Karthik Nagarajan & Brian Holland & Alan D. George &

K. Clint Slatton & Herman Lam

Received: 28 April 2008 /Revised: 21 November 2008 /Accepted: 23 December 2008 /Published online: 16 January 2009# 2009 Springer Science + Business Media, LLC. Manufactured in The United States

Abstract Machine-learning algorithms are employed in awide variety of applications to extract useful informationfrom data sets, and many are known to suffer from super-linear increases in computational time with increasing datasize and number of signals being processed (data dimension).Certain principal machine-learning algorithms are commonlyfound embedded in larger detection, estimation, or classifi-cation operations. Three such principal algorithms are theParzen window-based, non-parametric estimation of Proba-bility Density Functions (PDFs), K-means clustering andcorrelation. Because they form an integral part of numerousmachine-learning applications, fast and efficient execution ofthese algorithms is extremely desirable. FPGA-based recon-figurable computing (RC) has been successfully used toaccelerate computationally intensive problems in a widevariety of scientific domains to achieve speedup overtraditional software implementations. However, this potentialbenefit is quite often not fully realized because creatingefficient FPGA designs is generally carried out in alaborious, case-specific manner requiring a great amount of

redundant time and effort. In this paper, an approach usingpattern-based decomposition for algorithm acceleration onFPGAs is proposed that offers significant increases inproductivity via design reusability. Using this approach, wedesign, analyze, and implement a multi-dimensional PDFestimation algorithm using Gaussian kernels on FPGAs.First, the algorithm’s amenability to a hardware paradigmand expected speedups are predicted. After implementation,actual speedup and performance metrics are compared to thepredictions, showing speedup on the order of 20× over a3.2 GHz processor. Multi-core architectures are developed tofurther improve performance by scaling the design. Porta-bility of the hardware design across multiple FPGA plat-forms is also analyzed. After implementing the PDFalgorithm, the value of pattern-based decomposition tosupport reuse is demonstrated by rapid development of theK-means and correlation algorithms.

Keywords FPGA . Design patterns . Machine learning .

Pattern recognition . Hardware acceleration .

Performance prediction

1 Introduction

Machine learning is a general term that has been used torefer to a large and diverse set of methods for extractingpatterns from data. In general, it encompasses portions ofstatistical and adaptive signal processing, probabilisticdecision theory, such as Bayes Rule, and meta-heuristicstrategies, such as the genetic algorithm. As computationalcapacity has increased over the past few decades, machine-learning algorithms have become widely used for problemssuch as detection, estimation, and classification, involvingdiverse data types, including time series, image, and video

J Sign Process Syst (2011) 62:43–63DOI 10.1007/s11265-008-0337-9

K. Nagarajan (*) : B. Holland :A. D. George :K. C. Slatton :H. LamNSF Center for High-Performance Reconfigurable Computing(CHREC), Electrical and Computer Engineering Department,University of Florida,Gainesville, FL 32611, USAe-mail: [email protected]

B. Hollande-mail: [email protected]

A. D. Georgee-mail: [email protected]

K. C. Slattone-mail: [email protected]

H. Lame-mail: [email protected]

Page 2: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

(spatio-temporal) data [1]. Raw audio, image, and videodata do not explicitly contain high-level information andthus generally require an abstraction process to extractuseful information from the data. This process is oftenaccomplished via a sequence of data segmentation algo-rithms [1, 2]. Clustering algorithms are the most commontype of algorithms used for segmenting data points intogroups and employ various measures of similarity. Proba-bilistic algorithms can then be employed to represent thesegmented data as Probability Density Functions (PDFs)rather than single data points (e.g., cluster centers). Suchdata representations are essential in applications involvingdecision-making based on probabilistic reasoning [3–5].Alternatively, certain multimedia analysis tools use statis-tical methods to model relationships between segmenteddatasets as a function of the separation between them(dissimilarity). These methods are useful for applicationssuch as video content retrieval and indexing [6], videosegmentation, and automatic speech recognition systems[7, 8] that benefit from quantitative measures of thestatistical relationships between features.

Many of the machine-learning algorithms mentionedabove are primarily based on a data-driven approachwherein the computational burden is heavily impacted bythe volume of data being processed. The Parzen-based non-parametric PDF estimation algorithm is one such algorithmthat is computationally intensive, yet necessary for manyapplications in pattern recognition, such as fingerprintrecognition, Bayesian classification, feature extraction [5],bioinformatics [3], networking [9], stock market prediction[10, 11], and image processing [4]. The high computationalburden arises because most problems involve large volumesof data and require complex computations to be performedin a high-dimensional space. To mitigate this problem,application researchers have investigated ways to approx-imate the solution by either solving the problem under areduced dimensional space or providing alternative meth-ods like the Fast Gauss Transform (FGT) [12]. The FGTaddresses the dimensionality issue to a certain extent byreducing the number of computations in each dimension.Irrespective of the approximation method used, thecomputation involved still grows exponentially withincreasing dimensionality. Thus, fast execution of thePDF algorithm is important for solving high-dimensionalproblems without structural (algorithmic) approximationsand within reasonable timeframes. Further, according toAmdahl’s law [13], significant improvements to theoverall execution time of the multimedia analysis toolare obtained only when significant portions of it areimproved. In other words, overall efficiency gains aremost likely to occur when the principal (or most oftenused) portions of the machine learning tasks areimproved.

An emerging method for accelerating these algorithmsinvolves the use of reconfigurable computing (RC) basedon Field-Programmable Gate Array (FPGA) technology.FPGAs lie between general-purpose processors and ASICs(application-specific ICs) on the spectrum of processingelements in that they are highly flexible (like processors)and also have potential for high performance (hardwareacceleration). Commonly used, deterministic benchmarkingcores, such as Fast Fourier Transforms (FFTs) [14],convolution, Lower-Upper triangular matrix Decomposition(LUD) [15], and Basic Linear Algebra Subprograms(BLAS), have been effectively implemented on FPGAs byvirtue of their inherent multi-level functional and dataparallelism. In addition to these benchmarking cores,significant performance gains have been achieved withFPGAs by selecting, developing, and testing diverse appli-cations from various fields in signal and image processing.

However, developing FPGA designs is often time-consuming and not every algorithm is amenable tohardware acceleration. Preliminary analyses should beperformed to predict the algorithm’s amenability to ahardware paradigm before undertaking a lengthy develop-ment process. In contrast to the general computing domain,there is a relatively modest variety of FPGA-basedhardware platforms from which a designer can choose.These platforms primarily vary in the type of FPGA theyhouse and the manner in which data is communicatedbetween the FPGA and host processor; each platform canvary with respect to the number of constraints on theapplication design and performance. For example, decisionsregarding bit precision, design architecture, memory utiliza-tion, and storage depend on the FPGA’s resource availabilityand the size of the hardware design. The presence ofdedicated arithmetic blocks also dictates the implementationoptions for certain mathematical instructions. Thus, it is bestto determine the expected performance attainable at an earlystage when migrating an algorithm to an FPGA platform byemploying simple performance prediction tools.

Another barrier to successfully using FPGA-based RCtechnology for algorithm acceleration is the prohibitivetime and effort that is required to develop the necessaryFPGA core designs in a custom, case-specific manner. Amore desirable approach would be to recognize and exploitcommon patterns in terms of computational structure anddata flow across a variety of algorithms and prove thatthose design patterns are indeed suitable for use in otherFPGA implementations. The primary goal is hence todevelop a concise set of design patterns that can beformally used to represent the decomposition of analgorithm for parallel implementation. If the hardwaredesign of an algorithm can be represented at a high levelgraphically via composite patterns (mixture of computationand communication patterns), then other algorithms having

44 J Sign Process Syst (2011) 62:43–63

Page 3: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

a similar high-level representation can reuse the existingdesign with only minimal customization. Such a high-leveldescription is also desirable to aid inexperienced FPGAdesigners (e.g., domain scientists) so they can easilycomprehend and reuse existing hardware designs whiledeveloping new applications. Pattern-based descriptions ofdesigns could be used to effectively disseminate and sharehardware cores as well. Such an approach can hence lead toa significant increase in productivity via design reusability.Several important machine-learning algorithms, such as PDFestimation using Parzen windows, K-means clustering,correlation, and information-theoretic measures, have verysimilar underlying algorithmic and dataflow structures andthus could potentially benefit from such an approach. Ingeneral, the similarity in dataflow in many machine-learningalgorithms is due to the fact that application data (input to thelearning system) is compared with several representativevalues (parameters of the learning system) to performinference. In addition to being structurally similar, most ofthese algorithms are often applied to very large volumes ofdata and impose difficult constraints on memory.

This paper describes the pattern-based decomposition,analysis, and implementation of a scalable and portablearchitecture for multi-dimensional, non-parametric PDF esti-mation using Gaussian kernels on FPGAs. The work alsoshowcases the rapid development of two other algorithms, inparticular K-means clustering and correlation, by reusing thestructural and communication architecture developed for PDFestimation due to similarities in the composite patternsdescribing the respective algorithm decompositions.

The remainder of this paper is organized as follows.Section 2 discusses background on some related signalprocessing algorithms that are amenable for implementationon FPGAs, prior work on machine-learning for multimediaanalysis, and discussions on relevant performance predictionmethods and design patterns for Reconfigurable Computing(RC) used in this paper. The case-study algorithms and theirunderlying similarities are discussed in Section 3. Anoverview of design patterns and the motivations for pattern-based algorithm decomposition are given in Section 4.Preliminary analyses, the design methodology, and the basicarchitecture for PDF estimation are presented in Section 5with discussions on results and performance analysis.Section 6 illustrates the concept of architecture reuse withK-means clustering and correlation used as case-studyalgorithms. We conclude the work in Section 7 by reviewinginsights gained and potential for future work.

2 Background

Previous works have discussed the importance of non-parametric PDF estimation, K-means clustering, and corre-

lation in multimedia analysis. The authors in [7] employcross-modal correlations for clustering features in an image-audio dataset. Clustering and feature extraction methods areapplied for video content retrieval and analysis in [16]. Anovel video segmentation technique was presented in [2]based on correlating audio and video data. A simplifiedprobability-based methodology using histograms was pro-posed in [6] for video analysis. Hence, a large number ofapplications can benefit from fast computations of PDFestimation, K-means clustering and correlation, especiallyin time-critical missions. Previous works have also dis-cussed the feasibility of targeting some structurally similaralgorithms for FPGAs. In [17], the authors use a Gaussiankernel estimator as one of their case studies in presenting aMATLAB-to-VHDL application mapper. It is a subset ofthe algorithm considered here, in that our work employsGaussian kernels to estimate general PDFs and is hencemore computationally intensive. In [18], the authors presentsimpler algorithmic variants of the K-means algorithm byconsidering Manhattan and Max distances instead ofEuclidean distance for an efficient hardware implementa-tion. Significant acceleration of the algorithm with accept-able accuracy was achieved in clustering a sample dataset.This work is different from [18] in that Euclidean distanceis used for implementing K-means (which is the preferredmethod in most machine-learning applications) and, more-over, the primary motivation is to accelerate a set ofmachine-learning algorithms used in different stages ofmultimedia data processing by reusing hardware designsleading to shorter design times. Previous work in parallelprocessing for large data volumes is also relevant to thiswork. An object-recognition application was developed in[19] by traversing huge volumes of data in a parallel fashion.The paper showcased how a data-parallel programmingmodel can be used to solve a general set of data-intensivealgorithms. In many of these examples, the applicationscould achieve significant performance gains if there was away to rapidly compute multi-dimensional PDFs. PDFestimation enables the computation of metrics such as noise,error rates, and uncertainties that are very important forresearchers in the application domain [3–5, 20].

The authors in [21] and [22] had developed severalsignal-processing applications on FPGAs in order to obtainspeedup and discussed tradeoffs in solution accuracy andresource utilization. However, these papers only predictperformance factors in a relatively limited setting. Variousperformance prediction tools have been proposed [23, 24]that base their techniques on parameterizing the algorithmand target FPGA platform. Algorithms are decomposed andanalyzed to determine their total size (in terms of resources)and computational density. Computational platforms areprimarily characterized by their memory structure andcapacity and interconnect bandwidth. In particular, the RC

J Sign Process Syst (2011) 62:43–63 45

Page 4: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Amenability Test (RAT) presented in [23] is a simplemethodology that suggests a step-by-step procedure topredict the performance, in terms of speedup, of a specificdesign for an application on a specific platform beforecoding begins. This pre-implementation test also helps thedesigner to understand the strengths and weaknesses of aparticular algorithm towards parallel implementation. Com-munication and computation parameters in RAT arequantified based upon algorithm and FPGA platformcharacteristics. An estimate of performance in terms ofexecution time in hardware is then derived from theanalytical models embedded in RAT. Comparing thehardware prediction against a known execution time for asoftware baseline leads to the overall speedup estimation. Inthis work, we use RAT to predict performance of analgorithm on multiple FPGA platforms.

Building a good reconfigurable design requires skill andeffort, and is often accompanied by a steep learning curvefor designers without prior FPGA design experience. In[25], the authors emphasize the importance of identifyingand cataloging an exhaustive list of design patterns to solverecurring challenges in reconfigurable computing andincrease productivity. In this work, we validate the utilityof design patterns by identifying and classifying primitivepatterns relevant to this work. The patterns are thenquantified and applied for algorithm decomposition andperformance prediction using RAT.

3 Machine-learning Algorithms for Segmentation

Knowledge of the algorithm’s data flow and computationalcomplexity is essential in order to make strategic decisionsduring design development. In this section, we provide anoverview of the algorithms used in the case studies in thiswork and investigate the commonalities in their structures.

3.1 PDF Estimation

The common parametric forms of PDFs (e.g., Gaussian,Binomial, Rayleigh distributions) represent mathematicalidealizations and, as such, are often not well matched todensities encountered in practice. In particular, most ofthese classical parametric densities are unimodal (having asingle local maximum), whereas many practical problemsinvolve multimodal densities. Furthermore, because ofcorrelations among the data features, it is unusual forhigh-dimensional density functions to be accurately repre-sented as a simple product of one-dimensional densities, thelatter of which are significantly easier to compute. Thus, anestimate of the joint (multi-dimensional), non-parametricPDF (i.e. a PDF that assumes no particular functional form)is often required. The computational complexity of the

Parzen window-based PDF estimation algorithm is O(Nnd),where N is the total number of data points, n is the numberof discrete points at which the PDF along a dimension isestimated (i.e. bins) and d is the number of dimensions.Table 1 illustrates the computational burden incurred interms of time elapsed (where code is executed on a 3.2 GHzsingle-core Xeon processor) in computing PDFs of increas-ing dimension (number of signals considered) using theParzen window technique and Gaussian kernels. The entriesfor the estimated elapsed time were made by scaling thenumber of computations exponentially corresponding toincreasing dimensions. The total number of data samplesprocessed was set at 204,800 (N) and the support size oneach dimension, n, was set as 256. Mathematically, theprobability that point i falls in a d-dimensional space isgiven by,

p ið Þ ¼ 1

n1 ::: nd

Xn1j1¼1

:::Xndjd¼1

ϕ xi; xj1 ; h

� �::: ϕ yi; y

jd ; h� �

ð1Þ

where h is the bin size, which acts as a tuning parameter forresolution of the PDF estimate, the set (x j,...y j) represents thed subsets of bin centers at which the PDF is estimated, andthe set (xi,...,yi) represents the ith input data point in a d-dimensional space, where i ranges from 1 to N.

The kernel function 8 could be as simple as a histogramfunction, a more general rectangular kernel, or the widelyfavored Gaussian kernel. The first two cases fall under theclass of naïve estimators. In the first case (histogramfunction), the data range is divided into a set of successiveand non-overlapping intervals (bins), and the frequencies ofoccurrence in the bins are plotted against the discretizeddata range. The rectangular kernel case is similar exceptthat overlapping intervals are permitted. In either case, thebin size should be chosen such that a sufficient number ofobservations falls into each bin. The resulting PDF estimatedepends on the bin size as well as the discretized dataset’srange and is discontinuous at the bin boundaries.

Although naïve estimators yield discontinuous results,the construction can be easily generalized to achievecontinuous PDF estimates by employing different kernelfunctions. The smooth reproduction of the underlying

Table 1 Time elapsed on a 3.2 GHz single-core Xeon processor inestimating PDFs using Parzen windows.

D Time elapsed Factor increase in elapsed time

1 0.578 s 12 158.75 s ~2753 ~12 h (est.) ~2752

4 ~139 days (est.) ~2753

46 J Sign Process Syst (2011) 62:43–63

Page 5: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Gaussian process in Fig. 1a clearly shows the advantages ofGaussian kernels in this aspect and is thus the motivationfor its use here. The mathematical formula for a Gaussiankernel is defined in Eq. (2)

ϕ xi; xj; h

� � ¼ 1ffiffiffiffiffi2p

phe�

xi�xjð Þ22h2 ; ð2Þ

where h plays the role of the standard deviation. Theresulting expression for the 1-D Parzen window PDFestimate is,

p xj� � ¼ 1ffiffiffiffiffi

2pp

hn

Xnj¼1

e�xi�xjð Þ22h2 ð3Þ

Computing complex exponentials is challenging inhardware because it requires significant hardware resources.To make the algorithm more suitable for hardware, atruncated second-order Taylor series expansion replaces theexponential function for the development of the core inhardware, as defined in Eq. (4).

f xi; xj

� � ¼ 1ffiffiffiffi2p

ph

1� xi�xjð Þ22h2

� �; for 1� xi�xjð Þ2

2h2

� �� 0

0 ; otherwise

(

ð4Þ

The degree to which this quadratic approximates the trueGaussian kernel in Eq. (2) decreases for data points thatyield large values of |xi−xj|. But that condition is avoided bygiving zero weight to points that cause the argument1−(xi−xj)2/2h2 to be less than zero, which occurs when|xi−xj| is greater than √2h. The plot in Fig. 1a for theGaussian kernel employs the approximation given by Eq.(4). Additional management of approximation error is

achieved by pre-scaling the kernel to unity variance priorto computations on the FPGA. The datasets xi and xj arescaled by √2h before being transferred from the processorto the FPGA, which reduces the factor 1−(xi−xj)2/2h2 to1−(xi−xj)2, thus making the computational error due to theuse of the Taylor series representation on the FPGAindependent of the particular choice of h. The dynamicrange of the dataset can further be reduced by consideringonly the higher order bits during computation. This methodis a simple task to implement since FPGAs are extremelyefficient at performing bit-level manipulations.

3.2 K-Means Clustering

In any data analysis problem given a training set of points,unsupervised learning algorithms are often applied toextract structure from them. Typical examples of unsuper-vised learning tasks include the problem of image and textsegmentation [1]. A simple way to represent data is tospecify similarity between pair of objects. If two objectsshare the same structure, it should be possible to reproducethe data from the same prototype. This idea underliesclustering methods that form a rich subclass of unsuper-vised algorithms. The K-means algorithm is a popularunsupervised clustering algorithm that partitions N datapoints into m clusters by finding m mean vectors μ1, μ2,…,μm representing the centroids of the clusters (see Fig. 1b foran illustration of data points represented by two featurespartitioned into four clusters). It is an iterative algorithmthat tries to minimize the total intra-cluster variance, or, thesquared error function

V ¼Xmj¼1

Xxi2Sj

xi � mj

� �2ð5Þ

50 100 150 200 2500

0.002

0.004

0.006

0.008

0.01

No. of bins

Pro

babi

lity

of d

ata

in a

bin

-5 0 5 10 15 20 25-5

0

5

10

15

20

Feature 1

Fea

ture

2

cluster 1cluster 2cluster 3cluster 4

(a) (b)HistogramrectangularTaylor approx

Figure 1 Illustration of a 1-D PDF estimation using three differentkernels, and b K-means clustering. For the Gaussian kernel, the Taylorseries approximation is used. In general, PDFs can be highly non-Gaussian (e.g., bimodal) as shown here. Parzen estimates using

rectangular, and in particular Gaussian kernels, generally provideestimates of the true underlying PDF that are far superior to thehistogram function. In the case of data clustering, it is desired topartition complex data sets in groups that share similar feature values.

J Sign Process Syst (2011) 62:43–63 47

Page 6: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

where there are m clusters Sj, j = 1, 2,…, m and μi is thecentroid of all the points xi in Sj.

To begin with, the data points are partitioned into m initialsets, either at random or using some heuristic approach. Thecentroid of each set is then computed and a new partition isconstructed by associating each data point with the closestcentroid. The centroids are then recalculated for the newclusters. This process is repeated until data points no longerswitch clusters or alternatively centroids remain unchanged.The computational complexity of the algorithm is O(NdmT)where d is the number of features representing the datapoint and T is the number of iterations.

3.3 Correlation

Correlation computes a measure of similarity between twodatasets (e.g., audio signals) as they are shifted by oneanother across time. Correlation is often used to measurethe delay between two signals having the same shape withone delayed relative to the other. Autocorrelation is thecorrelation of a signal with itself while cross-correlation isthe correlation between two different signals. The correla-tion result reaches a maximum at the time when the twosignals match best. For example, if the two signals areidentical, this maximum is reached at t=0 (i.e. no delay).Autocorrelation is useful for finding repeating patterns in asignal, such as determining the presence of a periodic signalthat has been buried under noise, or identifying the funda-mental frequency of a signal that does not actually contain thatfrequency component but implies it with many harmonicfrequencies. Mathematically, the correlation between twodiscrete signals xi and yi with n samples each, where τ isthe delay and m is the maximum delay, is given by,

R tð Þ ¼Xni¼1

xiyiþt ; 0 � t � m; m < n ð6Þ

3.4 Algorithm Similarity

Although the algorithms overviewed in the previous sectionsaddress different problems, involve different computations,and appear at different stages during data analysis, theunderlying data flow within each algorithm has significantsimilarity. The data flow in these algorithms follows anexhaustive data permutation pattern wherein the interactionbetween every sample in two input vectors I1 and I2 affectthe value of every sample in an output vector O.

In PDF estimation, it is between the data points (I1) andthe points at which the PDF is estimated (I2). In K-meansclustering it is the Euclidean distance between the datapoints (I1) and the cluster centers (I2), and in correlation it isthe multiplication of two signals (I1 and I2) shifted over

different time delays. Fig. 2 illustrates the dataflow andcomputations involved in each of the three algorithms in acommon framework.

4 Pattern-Based Algorithm Decomposition

Designers who have achieved success in attaining orders ofmagnitude increase in performance (e.g., speed and power)through hardware acceleration of algorithms are in a selectgroup and are typically highly skilled at exploiting FPGA-based systems and related tools (e.g., hardware descriptionlanguages (HDLs), logic design, and performance predic-tion tools). To significantly increase the productivity ofFPGA design and development in general, we mustimprove the productivity of designers who are not FPGAexperts. High-level language tools, popularly known asapplication mappers (e.g., Impulse C, AccelDSP, Carte C,C2H), translate high-level language code to HDL versions,thereby improving productivity. Unfortunately, performancein such scenarios is highly dependent on the mapper’sefficiency in extracting parallelism from the algorithmstructure. The mapper also dictates algorithm decompositionfor a parallel implementation, restricting the designer’sfreedom to explore different decomposition strategies.

Another way of improving design productivity is toreuse and leverage existing hardware designs. One of theprimary motivations behind this work is to represent FPGAdesigns at a higher level of abstraction, thus making themmore readable and therefore reusable. Design patterns [26]were introduced in the software engineering domain ascommon solutions to recurring problems and have success-fully formed the basis for creating and exploring designs,reusing solutions for similar software applications. Morerecently, there have been efforts to apply pattern-baseddesign to FPGA-based reconfigurable computing [27, 28].In [25], the authors identified and cataloged an extensivelist of design patterns that can be used to solve a broadrange of problems spanning diverse fields. By decomposingalgorithms in terms of simple design patterns, a developeris able to understand the dataflow and structure behind theparallel implementation without delving into the details of ahardware circuit. In our work, we have identified a list ofprimitive design patterns and categorized them in such amanner that is most appropriate for our use for the design ofmachine-learning algorithms and for efficiently exploringthe designs using performance prediction tools like RAT[23]. Two of the most important categorizations of patternsand their examples are shown in Table 2: computationpatterns and communication patterns. While computationpatterns deal with extracting and implementing parallelismin the algorithm, communication patterns provide differentmeans of regulating data flow to keep the computation

48 J Sign Process Syst (2011) 62:43–63

Page 7: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

patterns busy. Note that the patterns listed in Table 2 areprimitive patterns. A variety of composite patterns can thenbe constructed from these primitive patterns to solvepotentially any complex problem. For example, a wavefrontpattern [25] can be constructed using a datapath replicationpattern with each parallel datapath comprising of pipelinepatterns of different lengths.

4.1 Pattern Description

The pipeline and datapath replication patterns are describedin this section using a standard format suggested in [25].These two primitive patterns are highly important in thatany composite pattern for the purpose of expressing andimplementing parallelism can be constructed from acombination of pipeline and datapath replication patterns.

4.1.1 Pipeline Pattern

& Intent: Used for implementing programming structureswherein successive/intermediate instructions overlap in

execution. The instruction throughput is increased byexecuting a number of instructions per unit of timeresulting in a lesser number of net clock cycles spentduring processing.

& Motivation: The primary motivation is to achievefaster processing speeds. When a program runs on aprocessor, the basic assumption is that each instruc-tion in the program is executed before the executionof a subsequent instruction is begun. However, inmany cases the instruction throughput could beincreased if the program allows the possibility toexecute more instructions per unit of time. Eachinstruction is computed and linked into a ‘chain’ soeach instruction’s output is an input to a subsequentinstruction.

& Applicability: This pattern is intended for programstructures where different instructions in the programexecution could operate over different data concurrently.

& Participants: A communication pattern regulates data-flow; partial/intermediate results may need to beaccumulated or buffered for subsequent usage.

Table 2 List of primitive patterns and their description.

Pattern name Description

Computation patterns1. Pipeline Increases throughput by overlapping execution of independent computations2. Datapath replication Exploits parallelism by duplicating computational kernels3. Loop fusion Alleviates potential pipeline stalls in nested loops by fusing them4. Tree [27] Provides computational structure for implementing tree-based algorithms5. Mesh [27] Provides templates for implementing image-based or 2-D problems6. Look-up tables [28] Reduces run-time computation of complex operators by simpler lookup ops

Communication patterns1. Scatter Distributes data among many computational kernels2. Gather Collects data (results) from many computational kernels3. Broadcast Replicates data to many computational kernels4. Round robin Scatters equally-sized blocks of data in sequential order among all kernels5. Memory resolution Resolves memory contention problems by inserting pipeline stages & buffers

Computation Kernels

PDF Estimation

Input 1 (I1)

Input 2 (I2)

Loop over I2

Loop over I1

Output (O)

…… …

…… …

…… …

Computeϕ (I1,I2) Compute p(I2)

K-means Clustering

Compute (I1-I2)2 Assign cluster Update centers

Correlation Measure

Compute I1 x I2(τ) Compute R(I1,I2)

I1 : Data pointsI2 : Points at which PDF is estimated

I1 : Data points I2 : Cluster set and their centers

I1,I2 : Datasetsk : Time lag (- < τ < )

Figure 2 Dataflow and computational structure of PDF estimation, K-means clustering, and correlation.

J Sign Process Syst (2011) 62:43–63 49

Page 8: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

& Collaborations: A communication pattern regulatesdataflow and keeps data fed to the pipeline.

& Consequences

1. When all instructions in the program execution arenot independent, the pipeline control logic (i.e. thecontroller) must insert a stall or wasted clock cycleinto the pipeline until the dependency is resolved.

2. While pipelining can theoretically increase perfor-mance over a non-pipelined core by a factor of thenumber of stages (assuming the clock frequency alsoscales with the number of stages), in reality, designlimitation will not always lead to ideal execution.

3. The instruction latency in a non-pipelined design isslightly lower than in a pipelined equivalent. This isdue to the fact that extra flip flops must be added tothe data path of a pipelined design.

& Known uses: Any algorithm/program that has indepen-dent instructions in its structure (e.g., PDF estimation,Correlation).

4.1.2 Datapath Replication Pattern

& Intent: Used for exploiting computational parallelismobserved in sequential programming structures such ascomputational loops.

& Motivation: The primary motivation is to achieve fasterprocessing speeds. Parallelism observed in computationscan be efficiently exploited using custom logic. Compu-tational structures such as loops iterate over the same setof processing instructions multiple times. In case of zerodata dependencies between iterations (or limited datadependency), the processing instructions can be replicat-ed in hardware to perform multiple iterations of the loopin parallel.

& Applicability: This pattern is intended for exploitingparallelism in computational structures such as loops. Animportant consideration for implementing the pattern isdata dependencies between loop iterations. Depending onthe individual case, the implementation may needintermediate buffers or communication links betweenparallel implementations of computational kernels.

& Participants: Apart from the replicated kernels, acommunication pattern regulates dataflow with possiblerequirements of additional buffers.

& Collaborations: Communication patterns regulatingdataflow and the total number of iterations required onthe kernel instances implemented in hardware.

& Consequences

1. Since the pattern implements an area-time tradeoff,higher processing speeds achieved via parallel

instances of the kernel come at the cost of increasedimplementation footprint in hardware.

2. Additional overhead in terms of parallel datacommunication and control logic are present.Depending on the FPGA platform and implemen-tation, the problem may be communication-bound,limiting the number of parallel kernels that can befed with data in parallel.

& Known uses: PDF estimation, Molecular dynamics,K-Means clustering, Sorting algorithms.

4.2 Quantifying Design Patterns

Since design patterns provide a formalized way to representalgorithm decomposition and the associated communicationfor parallel implementation, they can be quantified and usedin analytical models like RAT to predict the algorithm’samenability to hardware acceleration (to be described inSection 5.1), before undertaking a lengthy (and possiblyfruitless) development process. We can analyze the impactof each design pattern in terms of throughput and/oroverhead contribution to design performance. In this work,throughput of a design pattern is defined as the maximumnumber of operations that it can perform per clock cycle.Latency is defined as the number of cycles spent in datatransfers before any useful computation is performed. Forexample, a pipeline pattern has a throughput equal to thenumber of pipeline stages and contributes an equal cyclecount on latency (i.e. clock cycles spent in filling up thepipeline stages). The throughput, latency, and graphicalnotation of patterns relevant to this work are given inTable 3. Having decomposed the algorithm as a compositeof primitive patterns, the net throughput and latency can beinferred from the structure and behavior of the constructedcomposite pattern.

5 A Scalable and Portable Architecture for PDFEstimation

Having investigated and analyzed the commonalities sharedby the three case-study algorithms in this work (see Section 3)we suggest that an efficient FPGA design developed forone of the algorithms, say PDF estimation, can beeffectively reused for computing the others. For thisreason, an in-depth explanation of the design andperformance evaluation of a suitable architecture for onlythe PDF estimation algorithm will be given in thefollowing sub-sections. Then, in Section 6, we will discusshow the architecture developed for PDF estimation can beeffectively adapted by reusing most of its design fordeveloping other similar algorithms.

50 J Sign Process Syst (2011) 62:43–63

Page 9: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

The general architecture of the 1-D PDF estimationalgorithm based on pattern-based decomposition is shownin Fig. 3. A computational kernel updates the PDF value ata particular point x (i.e. a bin) based on the value of aparticular data sample xi (where xi є I1 and x є I2). TheParzen window technique is an embarrassingly parallelalgorithm where the PDF values at multiple points can beevaluated at the same time by replicating the computationalkernel (datapath replication pattern). The kernels in theparallel datapath are seeded with different values in I2 via ascatter communication pattern. Thus, data samples can bebroadcasted across kernels that can then process dataindependently in parallel (see Fig. 3). Each kernel can alsobenefit from a pipelined implementation while performingall the operations (see Eq. 4) required on every data sample(pipeline pattern). The PDF values are computed in 8(I1,I2),accumulated in p(I2) and stored in output memory O. Loadbalancing and reduction of communication and synchroni-zation requirements are necessary to ensure that hardware

outperforms its sequential software counterpart. In estimatingmulti-dimensional PDFs, computation of the kernel functions8 in each dimension are independent of the others and henceperformed in a parallel fashion within each pipeline. Internalregistering for each bin keeps a running total of all processeddata; these cumulative totals comprise the final estimation ofthe PDF function.

The development stages in the system design arehighlighted in Fig. 4 and explained briefly in the followingbullets for a 1-D PDF design. The same can be extended forhigher-dimensional PDF estimation with suitable modifica-tions made to the computational kernel.

& Performance Prediction—Based on a preliminary de-sign of the algorithm (in the form of design patternsshown in Fig. 3a) and the basic resources available for aselected FPGA-based computing platform (in the formof parameter values to quantify the design patterns), theattainable speedups are predicted using RAT. Numerical

k = 8

bcast

O

Load I1 & I2

L1: Loop over I1

L2: Loop over I2

ϕ (I1,I2)

p(I2)

End L2

End L1

Store O

(a) (b)

I11

5128

1

1

8

I2 FPGA

M

emory

ϕ (I1,I2)

I2

I1

O

p(I2)

512

1

Pipeline stage 3

Pipeline stages 1 & 2

1

8

scatter

Gather

Figure 3 Pattern-based a decomposition and b design of the 1-D PDF estimation algorithm.

Table 3 Throughput, latencyand notations of primitivepatterns.

Pattern Throughput (ops/cycle) Latency (cycles) Notation

Pipeline Tp = no. of pipeline stages Lp = no. of pipeline stages

Datapath replication Td = no. of replications N/A

Broadcast N/A N/A

Scatter N/A Ls = buffer length

Gather N/A Lg = buffer length

Tp

Td

bcast

J Sign Process Syst (2011) 62:43–63 51

Page 10: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

analysis was conducted and a suitable fixed-pointimplementation is chosen because probability valueslie between 0 and 1, negating the need for thedynamic range features of a floating-point format.Details of RAT’s performance prediction are providedin Section 5.1.

& Kernel and core design—The basic task of a kernel inthe 1-D case is the computation of the kernel function8 =1-(xi-x)

2. A core contains a number of kernels (k)with each kernel in the design performing theaforementioned computation for a different x (i.e.different seed). The computation increases in com-plexity while estimating higher-dimensional PDFs(refer Section 5.3). The parameter k is chosen basedupon the preliminary resource analyses performedearlier. Details concerning the kernel and core designsare given in Section 5.2.

& Test bench and simulation—Memory instantiations forx and xi are made, test bench files are generated, andfunctional simulations are performed to validate thedesign.

& Overall system design, verification, and visualization—Integration of the core with the host processor (middle-ware design) is developed followed by verification ofthe computed PDF.

5.1 Performance Prediction

Although FPGAs have much to offer in terms of flexibilityand performance, they are not amenable for all algorithms.RAT [23] is a simple methodology developed for predictingthe performance of a specific algorithm design on a specificplatform. By parameterizing a particular design strategy anda platform selection into RAT, the developer can analyzeand predict likely speedups attainable. For RAT, speedup isdefined as the ratio of execution time on a relevant general-purpose processor (tGPP) to the execution time on an FPGA(tRC).

speedup ¼ tGPPtRC

ð7Þ

For ease of predicting speedup, the RAT analytic modelstake the form of a worksheet, reproduced in Table 4, andfeature two important steps. The first step deals withestimating the communication burden (tcomm) involved intransferring data in and out of the FPGA. The entries in theworksheet related to this step are the communicationparameters (throughputideal, αwrite, αread) and the datasetparameters. The second step involves the estimation of timespent in performing computation (tcomp) over the datatransferred (Nelements) on the FPGA. In the original formu-

Performance prediction

Kernel & core design

Visualization Data verification

System design

Test bench

Functional simulation

Figure 4 Development stages in system design.

Table 4 RAT input parameters for analysis of 1-D and 2-D PDF estimation algorithms.

Dataset parameters 1-D PDF 2-D PDF

Nelements, input (elements) 512 1024Nelements, output (elements) 1 65536Nbytes/elements (bytes/elem) 4 4Communication parameters 1-D PDF 2-D PDFthroughputideal (MB/s) 1000 1000αwrite 0<α<1 0.37 0.37αread 0<α<1 0.16 0.16tcomm (sec) 6.0E-6 1.6E-3

Computation parameters 1-D PDF 2-D PDFNOps/element (ops/elem) 768 393216Td (ops/cycle) 8 16LatencyNET (cycles) 515 131072Tp (ops/cycle) 3 3fclock (MHz) 150 100tcomp (sec) 1.2E-4 4.3E-2

Software parameters 1-D PDF 2-D PDFtGPP (sec) 0.578 158.8Niter (iterations) 400 400

Predicted speedup 11.5 9.0

52 J Sign Process Syst (2011) 62:43–63

Page 11: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

lation of RAT, parameters Nops/element, fclock, throughputprocand Nelements determine computation time. The total timespent on the FPGA (tRC) to execute the entire algorithm iscalculated (see [23] for details) as,

tRC ¼ Niter tcomm þ tcomp� �

tcomm ¼ tread þ twrite

tread=write ¼Nelements � Nbytes=element

aread=write � throughputideal

tcomp ¼Nelements � Nops=element

fclock � throughputproc

ð8Þ

The memory available on the FPGA or the board hostingthe FPGA is often much smaller than what is required forstoring all of the application data. Niter denotes the numberof iterations of communication and computation required toprocess all available data. The authors in [23] suggest thatthe parameter throughputproc be approximately chosenbased upon the number of data samples that can beprocessed in parallel. This is not trivial to estimate in manycases and might lead to suboptimal predictions if carelesslychosen. A pattern-based design embeds the parallelism(throughput) and the accompanying overhead (latency)during algorithm decomposition and would help betterparameterize RAT while estimating tcomp. Latency effects inpipelines and buffers are easily extracted from the patternsas against manual interpretation. For example, in the PDFdesign (see Fig. 3), tcomp is computed as,

tcomp ¼ Latencynet þNelements � Nops=element

throughput

� �� 1

fclock

Latencynet ¼ Ls þ Lp þ Lgthroughput ¼ Tp � Td

ð9ÞThe worksheet in Table 4 shows the input parameters for

the RAT analysis for the 1-D and 2-D PDF estimationalgorithms. The communication parameters model a Nalla-tech H101-PCIXM card containing a Xilinx V4LX100 userFPGA connected to a Xeon host processor over a 133 MHzPCI-X bus. Although the entire application involves204,800 data samples, each iteration of the 1-D PDFestimation on the FPGA will involve only a portion of thatdata (512 data samples, or 1/400 of the total set) because ofmemory constraints on the FPGA. A corresponding inputbuffer size of 512 is chosen for entry in the worksheet.Since the computation is performed in a two-dimensionalspace for a 2-D PDF, twice the number of data samples (inblocks of 512 words for each dimension) is sent to theFPGA. The number of output elements in the RAT tablecorresponds to the number of outputs for every call to theFPGA. In the 1-D case, the estimated PDF values at 256

points are returned from the FPGA after all calls to theFPGA are complete (i.e. after Niter calls). There is sufficientblock RAM on the FPGA to hold the results of the 1-DPDF algorithm. Since the PDF output elements need not besent for each call, we account for the communication timeby distributing the total number of output elements overNiter calls. From Table 4, we understand that the FPGA iscalled 400 times resulting in an average value of <1 outputelement per call. We round this to the next largest integer(1 in this case). In the 2-D case, the PDF values arecomputed over 256×256 (i.e. 65536 in the worksheet)points and are sent back to the host for every call made tothe FPGA due to insufficient block RAM to store resultsbetween calls.

The computational density of the algorithm is defined interms of Nops/element. Each data sample in the 1-D PDF caserequires three operations at each of the 256 points at whichthe PDF is estimated, resulting in 768 operations. In the2-D PDF algorithm, each data sample requires 6 operationsat each of the 256×256 points leading to 393216operations. The computational throughputs for the 1-Dand 2-D designs are estimated using FPGA clock frequen-cies of 150 MHz and 100 MHz, respectively. Thethroughput of the design is determined based on theconstructed composite pattern (see Fig. 3)—eight paralleldatapaths, each comprising of a 3-stage pipeline, leads to anet throughput of 24 ops/cycle. Ls and Lg equal 256 (i.e.size of I2 used to seed the parallel datapaths) and Lp equalsthree leading to a net latency of 515 cycles. For the 2-DPDF design, the latency Ls and Lg increase to 65536 (256×256 seeds) >while the number of replicated datapathsequals 16 (i.e. eight for each dimension). The softwarebaseline (tGPP) for computing speedup values was mea-sured from an optimized C program (optimization flag wasset to O3) executed on a 3.2 GHz single-core Xeonprocessor using single-precision floating point. The trun-cated Taylor series expansion was used in place of theexponential in the C program as well. RAT predictions arethen estimated to determine an attainable execution timeand later compared against experimentally measured soft-ware results to compute speedup. The predicted speedup forthe 1-D and 2-D PDF algorithms was 11.5 and 9.0,respectively.

5.2 1-D PDF Design

A design of a 1-D PDF relies heavily on the availability ofdedicated arithmetic units and memory blocks. Not onlyshould the available resources be efficiently used but alsothe design should scale well with application complexity.Taking these points into account, a multi-core design with akey design parameter k (the number of kernels or paralleldatapaths in a core) is proposed. The support size of the

J Sign Process Syst (2011) 62:43–63 53

Page 12: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

PDF is 256 along one dimension (I2) and the number ofdata samples processed for every call to the FPGA is 512(I1). The multi-core aspect of the design is developed withconsideration toward future extensions to multi-FPGAsystems. Larger FPGAs would be able to house morekernels and hence process more data in parallel leading tofaster computations. In the following sections, single- anddual-core designs are described, along with a discussion ofscalability with respect to the number of cores.

5.2.1 Single-Core Design

The FPGA receives data (I1 and I2) over an interconnect(e.g., PCI-X, PCI-Express, RapidIO) and the core accessesdata from the FPGA memory. The basic blocks in thedesign are pictorially represented in Fig. 6a and the host-centric execution flow is illustrated in Fig. 7a. A set of datasamples (I1) is first loaded onto the FPGA that is thensignaled to start the computation. As the FPGA processesdata, the host processor polls for a “process complete”signal from the FPGA. The FPGA sends the “processcomplete” signal once the computation is finished and thehost sends in the next set of data samples (I1) until all dataare processed. Data and control flows to the parallel kernelshoused in the core are regulated by a finite-state machineexplained as follows (see Fig. 5). The set of seeds xscattered to the k parallel kernels in the core are representedby I2, the data samples xi by I1, and the PDF values by p inthe state machine description.

& Idle—Remain in idle state until activation signal GO=0;transition to scatter state if GO=1.

& Scatter—Distribute k values of I2 and p(I2) to theparallel kernels; transition to compute state.

& Compute—Broadcast data samples I1 sequentially,compute 8(I1,I2), and update p(I2); transition to scatterstate to load subsequent k values of I2, else move toGather state.

& Gather—Update p in FPGA memory; transition to donestate.

& Done—Send “process complete” signal done=1 to host;transition to idle state when GO=0.

5.2.2 Dual-Core Design

The architectural details and the execution flow of the dual-core design are illustrated in Figs. 6b and 7b, respectively.The state machine for a single-core design and thecorresponding core are replicated with changes made tothe data flow in the software control program at the hostend. Since multiple cores share the same interconnect,arbitration for bus access is performed while loading dataonto the on-chip memory of the FPGA and the subsequentpolling for the “process complete” signal. The FPGAsystems used in this work have a host processor forexplicitly managing data movement in and out of theFPGA. Data is written to the first core and, while the firstcore processes the data, the next set of data samples isloaded to the second core to process. The host then pollsthe first core for the “process complete” signal as thesecond core processes its data. There could be a situationwhere the host cannot poll one of the cores while it isloading data to the other core. The core that is not beingpolled would sit idle during this contention period.

5.3 2-D PDF Design

In the 2-D case, the PDF is evaluated at a matrix of pointsor bins (n1 rows of x values × n2 columns of y values).The number of computation grows from (N - n)2 + c to(N - n1)

2 + (N - n2)2 + c and the basic kernel computation

expands to 1-((xi-x)2 + (yi-y)

2) where xi, yi represent onedata sample. Despite the added complexity of the 2-D PDFalgorithm, the increased quantity of parallelizable operations

512 cycles

Idle Scatter

Compute

GatherDone

GO=0

reset

GO

=0

done=1

GO=1

GO=1

k cycles

CNT (I2)>256

CNT (I2)≤256

256 cycles

CNT → countFigure 5 State transition dia-gram for a single-core 1-D PDFdesign.

54 J Sign Process Syst (2011) 62:43–63

Page 13: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

makes this algorithm still amenable to the RC paradigm,assuming sufficient quantities of hardware resources areavailable. To perform the 2-D PDF estimation, the basicarchitecture designed for the 1-D case is extended withmodifications made primarily to the kernel design (seeFig. 8) and the control flow. Due to the limited amount ofFPGA memory, the PDF matrix is computed one column at atime (i.e. for all x and one y, represented by p(I2x,y)). Thispartial PDF matrix is returned to the host before a subsequentcall to the FPGA is made. The kernels in the 2-D PDFdesign are initialized with two sets of seeds (x and y).

The finite-state machine regulating the data andcontrol flow is illustrated in Fig. 9 and explained asfollows. The set of seed values x and y scattered to theparallel kernels (k in number) in the core are representedby I2x and I2y, the data samples xi and yi by I1x and I1y, and

the PDF column values by p(I2x,y) in the state machinedescription.

& Idle—Remain in idle state until activation signal GO=0;transition to scatter state if GO=1.

& Scatter—Distribute y, k values of I2x, and p(I2x,y) to theparallel kernels; transition to compute state.

& Compute—Broadcast data samples I1x and I1y concur-rently and update p(I2x,y); transition to scatter state toload subsequent k values of I2x, else move to Gatherstate.

& Gather—Update p(I2x,y) (i.e. column of PDF matrix) inFPGA memory; transition to done state.

& Done—Set the “process complete” signal done=1; whenGO=0 transition to scatter state if CNT(I2y)≤256 (pointto next element in y), else move to idle state.

Data bus Address bus Control signals

(a)

(b)

Datasamples xi

Seeds (bins) x

k parallel kernels

pdf(x) values

Port B

HO

ST

Port B Port B

Port A Port A Port A

Block RAM Block RAM Block RAM

Block RAMs

Block RAMs Core 1

k parallel kernels

Core 2 k parallel kernels

Control Router Host

xi

x

pdf(x)

xi

x

pdf(x)

Figure 6 Design details of asingle-core and b dual-core1-D PDF.

WRITE: Host writes input data to FPGA WAIT: Host polls and waits as FPGA processes data READ: Host reads output data from FPGA

Single-core

WRITE WAIT … READ

Dual-core

WRITE WAIT

WRITE WAIT READ WRITE WAIT

WRITE WAIT READ WRITE

time

time

(a)

(b)

Figure 7 Host-centric execu-tion flows for a single-core andb dual-core designs.

J Sign Process Syst (2011) 62:43–63 55

Page 14: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

5.4 Experimental Results

In this section, experimental results with the PDF FPGAdesigns are presented and analyzed. The experimentalplatform used throughout these experiments contains theNallatech H101 board with V4LX100 FPGA. ThisXilinx chip has dedicated DSP48 arithmetic blockscapable of performing rapid 18-bit multiplies andmultiply-accumulates compared to logic-based MACcores. The board communicates with the host processorover a PCI-X interconnect. A 32-bit wide communica-tion channel is used of which 9 bits were allocated forthe fixed-point fractional segment. Function-level simu-lation was performed in ActiveHDL from Aldec, andH101 implementation was rendered using DIMEtalkfrom Nallatech.

5.4.1 Speedup and Resource Utilization

One of the primary objectives in this work is to obtainspeedups in execution by exploiting parallelism at thehardware level when compared to the sequential versionexecution on a high-end CPU. The single-core 1-D and 2-DPDF designs operated at FPGA clock speeds of 150 MHzand 100 MHz, respectively. The number of kernels, k, in thecore was set to 8. The basic criterion was to develop anefficient design where the resources (e.g., DSP48s, blockRAMs, slices) are uniformly consumed. It should be notedthat, since the algorithm is embarrassingly parallel, im-proved performance can be achieved by scaling the design,i.e. processing more data in parallel by increasing k untilon-chip resources are exhausted.

Actual speedup and resource utilization factors for the1-D and 2-D PDF algorithms are presented in Table 5. Itcan be seen from the comparisons made in Table 6 that theactual speedups obtained are reasonably close to thepredicted speedups using RAT (originally shown inTable 4). The deviation between the values is primarilydue to inaccuracies in the estimate of tcomm values.Although the channel efficiency values (αwrite, αread)chosen for our RAT analysis are a reasonable assumptionfor large data transfers, they tend to be lower for smallerdata transfers as in this scenario. The tcomp predictions arerelatively close to the experimental values validating theadvantage of employing design patterns for algorithmdecomposition and performance prediction. The numberof DSP48s scales linearly with the number of kernels k inthe core. In the 2-D PDF scenario, each kernel uses twicethe number of DSP48s as was consumed in the 1-D casebecause the computation is conducted along two dimen-sions. Although the total amount of FPGA memoryconsumed by the PDF core is moderate, a significantpercentage of it is consumed by DIMEtalk’s interface tobuffer and transfer data to and from the host and FPGA.

512

1

FPGA

Mem

ory

ϕ (I1x,I2x) & ϕ (I1y,y)

I2x

I1x

O

p(I2x,y)

512

1

Pipeline stage 3Pipeline stages 1 & 2

1

8

I1y

y

Figure 8 Architecture design for 2-D PDF estimation algorithm.

512 cycles

Idle Scatter

Compute

GatherDone

GO=0

reset

done=1

GO=1

GO=1

k cycles

CNT (I2x)>256

CNT (I2x)≤256

256 cycles

GO=0 & CNT(I2y)≤256

GO=0 & CNT(I2y)>256

Figure 9 State transition dia-gram for a single-core 2-D PDFdesign.

56 J Sign Process Syst (2011) 62:43–63

Page 15: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

5.4.2 Data Verification

Even though significant speedup was shown in the previoussection, speedup is meaningless if the results are notaccurate. In this section, we verify the accuracy of the dataprovided by our FPGA designs. Data samples from abimodal mixture of Gaussian distribution and from a two-dimensional uni-modal Gaussian distribution were generatedin MATLAB to verify the functionality of the 1-D and 2-DPDF designs. PDF values were computed by generating800 KB (N=204,800) of data samples (n=256 for 1-D PDF;n1=256 and n2=256 for 2-D PDF). The computed PDFvalues were read and error in the solution was computed asthe difference in FPGA and GPP estimates. Maximum errorE of 3.8% and 0.57% were obtained for the 1-D and 2-Dalgorithms respectively (see Eq. 10), which are reasonablefor most real-time applications that involve decision-makingbased on probabilistic reasoning. The errors in the estimatesare due to the Taylor series truncation of the exponentialfunction rather than the fixed-point effects (the exponentialfunction was not truncated in the GPP estimates for thisstudy). The resulting PDFs shown in Figs. 10 and 11 wereplotted in MATLAB for verification.

E ¼ max pðxÞRC � pðxÞGPP � �

p xð ÞGPPð10Þ

5.4.3 Scaling to Multiple Cores

Since the PDF algorithm is embarrassingly parallel,significant speedup can be achieved by scaling the design

to multiple cores. In this work, dual-core architectures weredeveloped and evaluated to study inherent characteristicsfor scalability on single- or multiple-device systems. Thedual-core results for the 1-D and 2-D cases are summarizedand compared to the single-core implementation in Table 7.Design frequencies of 150 MHz and 70 MHz were obtainedfor the 1-D and 2-D algorithm designs, respectively. Thereduction in frequency was primarily because of increasedrouting delays attributed to a larger design. Consumption ofa greater percentage of DSP48 slices is also bound toincrease routing delays as the DSP48 slices are located atspecific areas across the IC. An important point to note hereis that speedup does not double in a dual-core design (ascompared to the single-core design) due to interconnectcontention as discussed in Section 5.2.2 and a reducedfrequency in the 2-D PDF case. These parameters wouldpotentially differ from one platform to another dependingupon the system interconnect bandwidth and its architec-ture. Also, the dual-core 1-D PDF design offered a higherincrease factor in speedup when compared to the dual-core2-D PDF design because time spent on data communication

Table 5 Speedup & utilization for single-core designs.

Description 1-D PDF 2-D PDF

DSP48s (%) 8 16BRAM (%) 12 15Slices (%) 11 13Actual Speedup 7.8 7.1

Table 6 Performance factors for single-core designs.

Description Predicted Actual

1-D PDF 2-D PDF 1-D PDF 2-D PDF

tcomm(sec) 6.0E-6 1.6E-3 2.5E-5 1.1E-2tcomp(sec) 1.2E-4 4.3E-2 1.4E-4 4.4E-2tRC (sec) 5.1E-2 1.7E+1 7.4E-2 2.2E+1Actual Speedup 11.5 9.0 7.8 7.1

-10 -5 0 5 10 15 20 25 300

0.005

0.01

0.015

GPP (double-precision)FPGA (fixed point [32,9])GPP (single-precision)

Intervals at which PDF is estimated

Pro

babi

lity

at e

ach

inte

rval

Figure 10 1-D PDF estimates: GPP vs. FPGA.

50 100 150 200 250

50

100

150

200

25050 100 150 200 250

50

100

150

200

250

FPGA estimate GPP estimate

Figure 11 2-D PDF estimates: GPP vs. FPGA.

J Sign Process Syst (2011) 62:43–63 57

Page 16: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

could be effectively hidden under time spent on computa-tion in the 1-D PDF case. This outcome was not achievablein the 2-D PDF designs due to longer communication times(columns of the PDF matrix are sent back to the host afterevery call to the FPGA) and more importantly a reducedfrequency of operation.

5.4.4 Portability Considerations

In addition to having scalability impacts, applicationcharacteristics also contribute greatly to platform selectionand performance. In this work, optimum choice of thenumber of kernels per core, k, should be made based uponthe interconnect bandwidth and FPGA resources available.An important byproduct of the RAT methodology, notmentioned earlier, is the computation and communicationutilization factors given in Eq. 11.

utilcomm ¼ tcommtcomm þ tcomp

; utilcomp ¼ tcomptcomm þ tcomp

ð11Þ

These metrics indicate the nature of a particular design interms of whether it is communication-bound or computa-tion-bound. If there are a variety of platforms to choosefrom, the designer could make use of these metrics tomodify architecture selection for achieving optimumresults. The primary resource-limiting factor in increasingthe value of k in this algorithm design is the number ofdedicated multiplier blocks in the FPGA. While platformswith FPGAs having more multipliers (e.g., the Cray XD1with Virtex-2 Pro FPGAs) and faster theoretical intercon-nects (e.g., RapidArray in XD1) might intuitively suggestimprovements in speedup by performing more computa-tions in the PDF algorithm in parallel, these could benegated by poor read and write efficiencies (αread, αwrite)during data communication. This case was particularly truewith the Cray XD1 system. Due to extremely low readspeeds on CPU-initiated transfers from FPGA-host memory(~4 MB/s for small data transfers), RAT predicted a smallerspeedup number (see Table 8) in migrating the 2-D PDFalgorithm to the XD1 system even though it housed atheoretically faster interconnect. In comparison to theNallatech platform, the utilization factors in Table 8illustrate the fact that in the XD1 more time is spent in data

communication as compared to algorithm computation forthe 2-D PDF design. On the contrary, since the computedPDF values are sent to the host after all FPGA iterations arecomplete, fewer read operations are required in the 1-D PDFdesign resulting in the XD1 offering more than double thespeedup achieved on the Nallatech platform.

6 Composite Patterns for Design Reuse

As discussed earlier, it is often the case that most of themultimedia analysis tools employ different algorithms atdifferent stages of data analysis. To be highly productive increating efficient hardware designs for each of thosealgorithms, it is essential that we reuse existing hardwaredesigns. One of the goals of design patterns is to providethe framework to realize this concept. In this section, weexploit the commonalities between the algorithms describedin Section 3.4 and the similarities in the composite patternsdescribing their decompositions by reusing the structuraland communication architecture developed for PDF esti-mation. Also, state transition diagrams (see Figs. 5 and 9)primarily deal with managing data flow and activate thecomputations to execute at appropriate times. This task isparticularly cumbersome to design and debug, and anymeans of reducing this burden would be helpful to design-ers. Due to similarities in algorithm decomposition, thestate transition diagram developed for PDF estimation canbe effectively reused for the other two algorithms.

6.1 K-means Clustering

K-means clustering can be decomposed in a similarfashion to that of PDF estimation. Computation of theEuclidean distance between every data sample and the kcluster centers can be done in parallel. The minimumdistance indicating the cluster closest to the data samplecan then be obtained hierarchically by comparing theEuclidean distances pair-wise in a parallel fashion (seeFig. 12). The basic update rule in K-means clustering forthe cluster to which the new sample belongs can be

Table 7 Speedup & utilization for dual-core designs.

Description Single-core Dual-core

1-D PDF 2-D PDF 1-D PDF 2-D PDF

DSP48s (%) 8 16 16 33BRAM (%) 12 15 15 21Slices (%) 11 13 16 22Actual Speedup 7.8 7.1 13.4 8.3

Table 8 Speedup & utilization for single-core designs acrossplatforms.

Description Nallatech Cray XD1

1-DPDF

2-DPDF

1-DPDF

2-DPDF

utilcomm 0.14 0.20 0.03 0.38utilcomp 0.86 0.80 0.97 0.62RAT Predicted speedup 11.5 9.0 13.0 3.7Actual Speedup 7.8 7.1 20.6 4.0

58 J Sign Process Syst (2011) 62:43–63

Page 17: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

defined as Δcenter = η (samplenew - centerold), where η isthe learning rate [29]. In this work, the cluster centers areupdated in the following manner, as and when a new datasample is assigned to a particular cluster:

centernew ¼ centerold þ samplenew2

ð12Þ

This approximate update rule (where η=1/2) is often usedfor on-line clustering and is applicable to real-time scenariosin which fast reaction to changing statistics is often desirable.The update method reduces the computational requirementsto addition and right shift operations and eliminates the moreresource-consuming division operation. While this offers anefficient hardware implementation, it can lead to slowerconvergence for broadly distributed clusters due to itsrelatively high sensitivity to each new sample. In such cases,the traditional approach proposed by MacQueen [29] can beused where η=1/nj (here, nj is the number of data points inclass Sj). This choice for η achieves a more gradual updatingof cluster centers as the number of data samples presented toalgorithm grows. The design developed in this work can beextended to employ MacQueen’s approach with minimalresource usage by the inclusion of a multiplier and look-uptable (for storing η values) to implement the update rule. Allthe computations involved in the algorithm can further beimplemented as a pipeline. Analyzing the composite patternfor decomposing K-means in Fig. 12, it can be inferred thatthe architecture developed for PDF estimation can beeffectively reused for K-means clustering with minimalmodifications as explained in the following section. Thecluster centers (I2) are used to seed the parallel datapaths inFig. 3a (for the PDF estimation algorithm) and the datasamples (I1) are fed into the pipelines sequentially todetermine their cluster association. The number of parallel

datapaths and the length of I2 are both increased from 8 (inthe architecture of PDF estimation) to 64, corresponding tothe number of clusters. Upon convergence, the updatedcluster centers are read back by the host. Fig. 12 shows thearchitecture for performing K-means clustering on a 1-Ddataset (N=102,400 and number of iterations for conver-gence of the algorithm T=100) over 64 clusters. In the firstpipeline stage, the Euclidean distance between the clustercenters and a data sample is computed. In the second stage,the cluster closest to the sample is inferred and in the thirdstage, the inferred cluster center is updated according to Eq.12. It can be inferred that only the pipeline computations aremodified in Fig. 3b to implement the clustering algorithmwhile retaining most of the communication fabric. Moreimportantly, the connection to the middleware design, whichis platform-specific and often a time-consuming process,remains unchanged.

For the 1-D case (i.e. d=1) the architecture reduces thecomputational complexity of the algorithm from O(NmT) toO(NT), where m is the number of clusters. This reduction incomplexity is attainable as long as there are enoughmultiplier resources (m multipliers) on the FPGA. Theexperimental speedup and utilization for K-means cluster-ing algorithm on the Nallatech platform is shown in Table 9.The speedup obtained is considerably low for the extent ofparallelism extracted in the algorithm (i.e. all operationsrequired on a data sample are performed in one cycle). Thespeedup is primarily affected by the overhead time spent intransferring data from the CPU to the FPGA over a relativelysmall-bandwidth low-efficient interconnect. While only 400iterations (Niter in the RAT worksheet) of FPGA communi-cation and computation were required for PDF estimation, atotal of 100×200 iterations are required for K-meansclustering (200 for processing all available data and 100 foralgorithm convergence) leading to a considerable increase in

(I1 – I2)2

61

1

I2

I1

Pipeline stage 1

512

Update I22

21 )(min2

– III

k = 64

bcast

I2 I1

I2

1

512 64

1

64

1

Pipeline stage 2

Pipeline stage 3

1

1

61

scatter

gather

Figure 12 Architecture for K-means clustering where k corre-sponds to number of clusters, I1is the set of data points, and I2are the cluster centers.

J Sign Process Syst (2011) 62:43–63 59

Page 18: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

total RC time (tRC= Niter×(tcomm+tcomp)). The utilcomm factorfor K-means clustering is slightly higher in comparison tothat obtained for 1-D PDF estimation indicating a longertime spent in data communication.

6.2 Correlation

The architecture developed for PDF estimation is also agood fit for computing correlation on an FPGA. Thecomputations involved in calculating the cross-correlationvalue at different lags τ can be decomposed and mapped ina similar way as that of PDF estimation for a parallelimplementation (see Fig. 13b and 13c).

The similarity in dataflow between the two algorithmscannot easily be deciphered until we analyze a decompo-sition strategy for Correlation via design patterns. Considertwo vectors X=[x1,x2,x3,…,x512] and Y=[y1,y2,y3,…,y512]whose cross-correlation needs to be computed. A numberof product terms involved in the computation can beperformed in parallel. In particular, the xi values (I2) seedthe parallel datapaths in Fig. 3a while the yi values (I1) arefed sequentially in a pipeline flow. Due to resource

Table 9 Speedup & utilization for PDF estimation, K-meansclustering, and correlation.

Description Single-core

1-DPDF

2-DPDF

K-Means Correlation

DSP48s (%) 8 16 66 66BRAM (%) 12 15 9 11Slices (%) 11 13 9 11utilcomm. 0.14 0.20 0.18 0.21RAT predicted speedup 11.5 9.0 4.3 9.6Actual speedup 7.8 7.1 3.2 8.1

Scatter xi (I2)

Bro

adca

st y

i (I 1

)

Multiply Accumulate

FPGA

M

emory

61

1

1

1

I2 x

I1 y

O

512 k = 64

bcast

I2 I1

O

1

512 64

1

1023

1

output address ↑ by 1 every cycle

(b)(c)

61

scatter

gather

◊ x1 x2 x3 x64y512 Clk=1

◊ x1 x2 x3 x64y511 Clk=2

x1y512

+

x2y512 x3y512 x64y512

x1y511 x2y511 x3y511

+ +x64y511

+

◊ x1 x2 x3 x64y1 Clk=512

+ + + +

◊ x65 x66 x67 x128y512 Clk=513

+

x65y512

Rτ =-511 Rτ =-510

(a)

Figure 13 a Raster-like computation structure, b decomposition, and c design of cross-correlation.

60 J Sign Process Syst (2011) 62:43–63

Page 19: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

limitations (multipliers) on the FPGA, correlation valuesare computed over 64 parallel datapaths (i.e. in blocks of 64values of xi) and accumulated over until all xi values areaccounted for. While accumulating values every cycle, thewrite address of output memory O is incremented by one tomatch the raster-like pattern observed in computingcorrelation at various lags (see Fig. 13a). Since theunderlying architecture is scalable and portable (seeSections 5.4.3 and 5.4.4), the designs can be extended tosolve larger scale problems (i.e. clustering data points intomore clusters/computing correlation over longer datasets)and across various platforms having bigger FPGAs withminimal efforts. The experimental speedup and utilizationfor the correlation algorithm on the Nallatech platform isshown in Table 9.

7 Conclusions

In this paper we have identified the need for fast andefficient execution of certain principal machine-learningalgorithms used often in multimedia analysis tools. Inproposing FPGA-based reconfigurable computing as asuitable technology for hardware acceleration, we analyzedvarious challenges faced while designing and developingalgorithms on FPGAs. Algorithm decomposition, perfor-mance prediction, and design reuse for parallel implemen-tation all need to be performed in an efficient and structuredmanner for developing successful FPGA designs in aproductive manner. To address these challenges, weproposed an approach for pattern-based decomposition ofalgorithms for FPGA design and development.

Significant performance improvements in terms ofspeedup were obtained by using FPGA-accelerated imple-mentation of the Parzen window-based PDF estimationalgorithm decomposed using primitive RC design patterns.Dual-core architectures were developed and evaluated on asingle-device system by exploiting the scalability in thealgorithm. Key design parameters were identified for tuningthe architecture to suit different platforms as well. Precisioneffects were investigated and data verification along witherror statistics suggested a sufficient fixed-point configura-tion for the algorithm. We also validated the benefit ofcomposite patterns for design reuse. With the successfuldesign of a scalable and portable architecture for the PDFalgorithm, rapid hardware development for the K-meansclustering and correlation algorithms was possible due tothe similarity of their underlying design patterns. Further,the work also showcased the benefit of quantifying designpatterns in efficiently exploiting a performance predictiontool, RAT, to predict an algorithm’s amenability to ahardware platform before undertaking a lengthy develop-ment process.

The architecture developed in this work was designed withconsideration toward future extensions to multi-FPGA sys-tems. Further research is warranted in understanding thepotential improvements that can be achieved and bottlenecksthat might have to be addressed while migrating designs tomulti-FPGA systems. Directions for future work also includeinvestigating and developing design patterns for solving otherproblem sets that have underlying similarity in their algo-rithms. This would be a critical step in enabling FPGA-basedRC for solving a general class of computationally intensiveproblems. This work has shown that one such generalarchitecture exists for a class of machine-learning algorithms.

Acknowledgement This work was supported in part by the I/UCRCProgram of the National Science Foundation under Grant No. EEC-0642422. The authors gratefully acknowledge vendor equipment and/or tools provided by Xilinx, Aldec, Cray, and Nallatech that helpedmake this work possible.

References

1. Camastra, F., & Vinciarelli, A. (2008). Machine Learning forAudio, Image and Video Analysis—Theory and Applications.Springer-Verlag London Limited. (16).

2. Dimitrova, N., Zhang, H., Shahraray, B., Sezan, I., Huang, T., &Zakhor, A. (2002). Applications of Video-content Analysis andRetrieval. Journal of Multimedia, 9(3), 42–55. (21).

3. Lillo, F., Basile, S., & Mantegna, R. N. (2002). Comparativegenomics study of inverted repeats in bacteria. Bioinformatics, 18(7), 971–979. (4).

4. Pitie, F., Kokaram, A. C., & Dahyot, R. (2005). N-DimensionalProbability Density Function Transfer and its Application to ColorTransfer. Proc 10th International Conference on Computer Vision,Beijing, 1434–1439. Oct. 2005. (8)

5. Kay, S. M., Nuttall, A. H., & Baggenstoss, P. M. (2003).Multidimensional probability density function approximationsfor detection, classification, and model order selection. IEEETransactions on Signal Processing, 49(10), 2240–2252. (14).

6. Chang, Y., Zeng, W., Kamel, I., & Alonso, R. (1996). Integratedimage and speech analysis for content-based video indexing. ProcMultimedia Computing and Systems, Japan, 306–313. (17)

7. Zhang, H., Zhuang, Y., & Wu, F. (2007). Cross-modal correlationlearning for clustering on image-audio dataset. Proc of the 15thInternational Conference on Multimedia, Augsburg, Germany,273–276. (18)

8. Goecke, R., Millar, J. B., Zelinsky, A., & Robert-Ribes, J. (2001).Analysis of audio-video correlation in vowels in AustralianEnglish. Intl Conf. on Audio-Visual Speech Processing, Aalborg,Denmark, 115–120. (28)

9. Llyas, M. (1987). General probability density function of packetservice times for computer networks. Electronics Letters, 23(1),31–32. (5).

10. Bliss, R. R., & Panigirtzoglou, N. (2002). Testing the stability ofimplied probability density functions. Journal of Banking &Finance, 26(2), 381–422. (6).

11. Scheicher, M., & Glatzer, E. (2003). Modelling the impliedprobability of stock market movements. Working Paper Series 212European Central Bank. (7)

12. Greengard, L., & Strain, J. (1991). The fast gauss transform. SIAMJournal on Scientific and Statistical Computing, 12(1), 79–94. (15).

J Sign Process Syst (2011) 62:43–63 61

Page 20: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

13. Culler, D. E. & Singh, J. P. (1999). Parallel computer architecture:a hardware/software approach. Morgan Kaufmann. (19).

14. Hemmert, K. S., & Underwood, K. D. (2005). An analysis of thedouble-precision floating-point FFT on FPGAs. IEEE Symp. onField-Programmable Custom Computing Machines, Washinton,DC, 171–180. Apr. (1).

15. Govindu, G., Choi, S., Prasanna, V., Daga, V., Gangadharpalli, S.,& Sridhar, V. (2004). A high-performance and energy-efficientarchitecture for floating-point based LU decomposition onFPGAs. IEEE Symposium on Parallel and Distributed Processing,Santa Fe, NM, 149. (2). Apr.

16. Jiang, H., Lin, T., & Zhang, H. Video Segmentation with theAssistance of Audio Content Analysis. International Conferenceon Multimedia and Expo, New York, NY, 1507–1510. (20).

17. Kim, J. S., Mangalagiri, P., Irick, K., Vijaykrishnan, N.,Kandemir, M., Deng, L., et al. (2007). TANOR: A Tool forAccelerating N-body Simulations on Reconfigurable Platform.Proc of the 17th Int. Conf. on Field Programmable Logic andApplications, Amsterdam, 68–73. Aug. (9).

18. Leeser, M., Theiler, J., Estlick, M., & Szymanski, J. J. (2000).Design tradeoffs in a hardware implementation of the K-meansclustering algorithm. Proc. Of Sensor Array and MultichannelSignal Processing Workshop, 520–524.

19. Frohlich, I., Gabriel, A., Kirschner, D., Lehert, J., Lins, E., Petri, M.,et al. (2002). Pattern Recognition in the HADES—Spectrometer: AnApplication of FPGA Technology in Nuclear and Particle Physics.Proc International Conference on Field-Programmable Technology(FPT), Singapore, 443–444. Dec. (10).

20. Neo, S., Goh, H., Ng, W. Y., Ong, J., & Pang, W. (2007). Real-time Online Multimedia Content Processing: Mobile VideoOptical Character Recognition and Speech Synthesizer for theVisual Impaired. Proc Intl. Convention on Rehabilitation Engi-neering and Assistive Technology, Singapore, 201–206. (22).

21. Schmit, H., & Thomas, D. (1995). Hidden Markov modeling andfuzzy controllers in FPGAs. Proc Symp. on FPGAs for CustomComputing Machines, Napa, CA, 214–221. Apr. (11).

22. VanCourt, T., &Herbordt, M. (2005). Three Dimensional TemplateCorrelation: Object Recognition in 3D Voxel Data. Proc ComputerArchitecture for Machine Perception, Washinton, DC, 153–158.(12).

23. Holland, B., Nagarajan, K., Conger, C., Jacobs, A., & George, A. D.(2007). RAT: A Methodology for Predicting Performance inApplication DesignMigration to FPGAs. Proc. of High-PerformanceReconfigurable Computing Technologies & Applications Workshop(HPRCTA 2007), SC’07, Reno, NV. Nov. 11. (3).

24. Steffen,C. P. (2007). Parametrization of Algorithms and FPGAAccelerators to Predict Performance. Proc. of ReconfigurableSystem Summer Institute (RSSI), Urbana, IL, 17–20. (23).

25. DeHon, A., Adams, J., DeLorimier, M., Kapre, N., Matsuda, Y.,Naeimi, H., et al. (2004). Design Patterns for ReconfigurableComputing. IEEE Symp on Field Programmable Custom Com-puting Machines, Napa Valley, CA, 13–23. (24).

26. Gamma, E., Johnson, R., Helm, H., Vlissides, J.M., & Booch, G.(1994). Design Patterns: Elements of Reusable Object-OrientedSoftware. Addison–Wesley Professional, 416.

27. Anvik, J., MacDonald, S., Szafron, D., Schaeffer, J., Bromling, S.,& Tan, K. (2002). Generating Parallel Programs from theWavefront Design Pattern. Proc Intl Workshop on High-LevelParallel Programming Models and Supportive Environments, FortLauderdale, FL, 104–111.

28. Gribbon, K.T., Bailey, D. G., & Johnston, C. T. (2005). DesignPatterns for Image Processing Algorithm Development onFPGAs. IEEE TENCON, 1–6.

29. Mashor, M. Y. (1998). Improving the Performance of K-meansClustering Algorithm to Position the Centres of RBF Networks.International Journal of the Computer, the Internet andManagement, 6(2).

Karthik Nagarajan received the B.E. degree in Electronics andCommunication Engineering from the University of Madras, Chennai,India in 2003, and the M.S. degree in Electrical Engineering from theUniversity of Florida, Gainesville in 2005. He is currently a Ph.D.Candidate in the Department of Electrical and Computer Engineering,University of Florida, and a Graduate Research Assistant in the AdaptiveSignal Processing Laboratory (ASPL) and the NSF Center for High-performance Reconfigurable Computing (CHREC). His research interestsinclude pattern recognition, information theory, graphical models andmultiscale estimation concepts. He is also an active researcher in the fieldof high performance and reconfigurable computing on FPGAs with focusupon design and development, performance analysis and productivityimprovements.

Brian Holland is a fourth year Ph.D. student at the University ofFlorida and a research assistant at the NSF Center for High-performanceReconfigurable Computing (CHREC). He received his B.S. degree incomputer engineering from Clemson University in 2005 and his M.S.degree in computer engineering from the University of Florida in 2006.His research interests include reconfigurable computing, high-levellanguages for FPGA design, and strategic methods for parallelalgorithm design and migration.

62 J Sign Process Syst (2011) 62:43–63

Page 21: Accelerating Machine-Learning Algorithms on FPGAs using Pattern-Based Decomposition

Alan D. George is Professor of Electrical and Computer Engineering atthe University of Florida, where he serves as the Director of the NSFCenter for High-performance Reconfigurable Computing (CHREC)and the High-performance Computing and Simulation (HCS) ResearchLaboratory. He received the B.S. degree in Computer Science and theM.S. in Electrical and Computer Engineering from the University ofCentral Florida, and the Ph.D. in Computer Science from the FloridaState University. Dr. George’s research interests focus upon high-performance architectures, networks, systems, and applications inreconfigurable, parallel, distributed, and fault-tolerant computing. Heis a senior member of IEEE and SCS, a member of ACM and AIAA,and can be reached by e-mail at [email protected].

K. Clint Slatton received B.S. and M.S. degrees in aerospaceengineering and M.S. and Ph.D. degrees in electrical engineering, allfrom the University of Texas (UT) in Austin, TX in 1993, 1997, 1999,and 2001, respectively. From 2002 to 2003, he was a PostdoctoralFellow with the Center for Space Research at UT, where he worked onmultiscale data fusion techniques for interferometric synthetic apertureradar (InSAR) and airborne laser swath mapping (ALSM) measure-

ments. He also worked at the NASA Jet Propulsion Laboratory in theRadar Sciences Section and was a recipient of the NASA GraduateStudent Research Program Fellowship. He is currently an AssistantProfessor with the Department of Electrical and Computer Engineeringand the Department of Civil and Coastal Engineering at the University ofFlorida in Gainesville, FL. He is the Director of the Adaptive SignalProcessing Laboratory and a coinvestigator for the National Center forAirborne LaserMapping (NCALM), for which he develops information-theoretic segmentation methods for ALSM data in complex environ-ments, such as forests. He was named a 2006 winner of the PresidentialEarly Career Award for Scientists and Engineers (PECASE) for work onpredicting signal propagation in highly cluttered environments usingremotely sensed geometry from ALSM. His research interests includeremote sensing applications of ALSM and InSAR, high dimensionaldata segmentation, multiscale data fusion, graphical models, andphotonics, with sponsored projects from NSF, NASA, the US Army,the US Navy, and NOAA. Dr. Slatton is a member of the AGU and alsoof Sigma Xi, Tau Beta Pi, and Sigma Gamma Tau.

Herman Lam is an Associate Professor of Electrical and ComputerEngineering at the University of Florida. He has over 20 years ofresearch and development experience in the area of database manage-ment, service-oriented architecture, and distributed computing. Cur-rently, his work is focused on reconfigurable computing and is aresearch member of the NSF Center for High-Performance Reconfig-urable Computing (CHREC). He has authored or co-authored over 85refereed journal and conference articles and one textbook. In pastseveral years, Dr. Lam has been on the program committees of theInternational Conference onWeb Services (ICWS) and the InternationalConference on Web Information Systems Engineering and a reviewerfor a number of workshops and conferences, including InternationalWorkshop on High-Performance Reconfigurable Computing Technol-ogy and Applications (HPRCTA’08) and 2008 International Conferenceon ReConFigurable Computing and FPGAs. He is a member of theeditorial board of International Journal of Business Process Integrationand Management.

J Sign Process Syst (2011) 62:43–63 63