The Content-Based Image Retrieval Project...The Content-Based Image Retrieval Project Bryan Catanzaro and Kurt Keutzer 1 Introduction The Content Based Image Retrieval Project was

Chapter 3

The Content-Based Image Retrieval Project

Bryan Catanzaro and Kurt Keutzer

1 IntroductionThe Content Based Image Retrieval Project was one of Par Lab’s five motivating applications. In this project, werethought key algorithms in computer vision and machine learning, designing them for efficiency on parallel machines.Our parallel implementations achieved significant speedups of 20-100× over existing sequential algorithms. This workshowed that parallelism can enable new usage models: allowing computationally intensive algorithms to be applied inareas where they were not previously feasible. In this chapter, we will outline the history of the project, highlightingtwo papers that illustrate our approach and results.

In May 2006, well after the initiation of the “Proto-Par Lab” meetings, but about 6 months before the publicationof the “Berkeley View” [2] paper, Carole Dulong visited Berkeley from Intel. Carole was working to develop a setof representative parallel workloads in an advanced applications laboratory at Intel and was interested in Berkeley’srelated efforts to identify a core set of computations. Carole came to Berkeley ready with a characterization of herworkloads relative to the computational dwarfs we were developing; these dwarfs later evolved into our set of compu-tational patterns discussed in the Introduction of this book and in the chapter on Patterns. Carole’s visit reinforced twoopinions already circulating among Par Lab researchers. The first was the importance of characterizing representativecomputations in order to guide the design of future parallel architectures. The second was the necessity of focusing onwide parallelism, with more than 32 processor cores. While the Berkeley group was already heading in this direction,Carole gave an interesting industrial angle: From her perspective, programmers could cope with evolving sequentialcode up to about 32 processors; however, after this ad hoc methods to managing threads began to unravel.

A third outcome of Carole’s visit was kindling an interest in multimedia applications at Berkeley, in particular,content-based image retrieval. While application-driven research was already a direction among Par Lab researchers,Carole made a convincing case for the particular value of parallelism in media analysis tools and presented her effortson the Personal Image Retrieval Organizer (PIRO) [3].

2 PIROPIRO was motivated by the observation that our life histories are increasingly digitized, which opens up new pos-sibilities and motivations for searching and organizing digital media. Without content-based image retrieval, searchof large media databases functions well only if the media is explicitly tagged by the user—for example, with names,events, and places. Tagging every piece of media one owns is time consuming, which compounds with the increasingrate of digital media creation to make managing and utilizing media databases very difficult. What is the value ofthe needle in the haystack? We must have practical content-based media retrieval systems in order to make use of

77

The Berkeley Par Lab: Progress in the Parallel Computing Landscape

the enormous volumes of data that we generate, since data that can’t be found is not useful. PIRO aims to solve thisproblem by enabling content-based imaging for personal image databases. Given a few similar images of interest,perhaps containing the same people or in the same location, PIRO finds other similar images. It does this by traininga classifier tailored to find images like the images the user selected, and then classifying all images in the database,ranking them according to similarity to the user’s query.

In our own efforts in content-based image retrieval we followed the basic model in PIRO. In particular, we viewedthis problem as a typical machine learning application with two main parts: feature extraction and classification. Ouraim was to show that the computational power afforded by parallel computers could enable detailed and intelligentcontent-based image retrieval systems.

Computationally, there were several interesting pieces to this application. Firstly, the images had to have featuresextracted from them as they were added to the database. These features summarized the essential characteristics ofeach image, such as color distribution and textures. Secondly, when the user made a query, a classifier had to betrained, using the features present in the images of interest. To complete the query, all images in the database wereclassified by this classifier, again making use of the features previously extracted.

This application was a perfect proxy application for many machine learning workloads. Data is summarizedthrough feature extraction, which reduces large amounts of data into a few salient features. For example, each image,which may contain millions of pixels, is summarized into a few hundred numbers describing the most importantfeatures of the image. After feature extraction, a classifier is trained using some example data in order to learna boundary separating interesting items from non-interesting items. For example, a classifier could be trained todistinguish between pictures of trees and pictures of flowers, given a feature space that gave information about colordistribution and texture in the image. Once the classifier is trained, it is applied to unknown items of data, outputtinga score for each data item that measures how interesting it is according to the boundary the classifier learned earlier.

3 Classifiers

We decided to work on this problem from the bottom up, focusing on individual components of the entire application.The first one we tackled was the classifier. There are many different kinds of classifiers that one could use; PIRO usedtwo, a k-nearest neighbors classifier, and support vector machines. We decided to examine SMVs [5], since they aregeneral purpose and successfully applied in a wide variety of fields. SMVs are engineered to learn the most general,simplest classifier possible given the training data set, while still avoiding overfitting to the particulars of the trainingdata set. This gives them good generalizability when classifying new, previously unknown data.

Training SVMs can be very computationally intensive, so we decided to investigate the problem. We examinedpopular SVM training algorithm known as sequential minimal optimization. Despite its name, we found this algorithmcan be effectively parallelized when training with many example points. We improved the algorithm by implementinga hybrid first- and second-order heuristic to guide the optimization problem. Our parallel implementation achieved a20× speedup over the standard SMO algorithm commonly used. As mentioned earlier, we wanted to focus on highlyparallel algorithms running on highly parallel systems, in order to ensure our work scales far into the future. For thisreason, we chose to implement our parallel algorithms on graphics processors or GPGPUs, which encourage algorithmdesign with tens of thousands of concurrent threads, ensuring that we didn’t simply add a few threads to a sequentialprogram. We also examined SVM classification, which we cast as a dense linear algebra to achieve large speedupsover a sequential implementation.

We sent our work to the International Conference on Machine Learning, arguably the foremost conference formachine learning, and it was published at the 2008 conference. We have included this paper in this chapter. Wealso published our code as an open source project, and to date, we have had 3000 downloads of our software, whichcontinues to be used. The ICML paper begins with a description of the SVM training and classification algorithms,describes the working set selection heuristic we proposed, and then discusses our parallel implementation on graphicsprocessors.

4 Contour Detection

Our success in speeding up SVMs was essential in building credibility with other application researchers and do-main experts. We showed others that computationally intensive problems can be dramatically more efficient when

78 Chapter 3: Introduction


re-architected for highly parallel hardware. For example, early on we had approached Jitendra Malik, a leading com-puter vision researcher at Berkeley, with the potential for parallelizing vision applications; however, it was only afterour success with SVM that he proactively approached us with a computer vision problem. Malik and his group haddeveloped an image contour detector that was more accurate than any other yet published; however, it was extremelycomputationally intensive, requiring several minutes to run, even on small images. This severely limited the appli-cations in which it could be used: for example, running the image contour detector, known as Global Probability ofBoundary or gPb [7] interactively on a personal computer or applying it on every image in a large image database wassimply infeasible. Given our experience with PIRO, and our success with classification, we were also interested inexamining feature extraction problems, as they constituted the other half of the PIRO application. We decided to turnour attention to this algorithm next.

There are many different kinds of features that can be extracted from images. As we mentioned earlier, thesimplest features are things like color histograms. But features can also be more semantic, like using face recognitionto automatically label images as containing particular people. One of the fundamental image processing operationscommon to many higher level feature extraction techniques is segmentation. And the first step towards creating goodimage segmentations is good contour detection, which labels each pixel as to whether it represents an edge in theimage. gPb produces the highest quality contours when compared against human-labeled ground truth data, rejectinglocal noise in the image by using global information to find only the most important edges. Professor Malik contactedus with the hope that if we could significantly improve the efficiency of this approach, gPb would find much widerusage.

Besides simply parallelizing the gPb calculation, we re-architected it to be more algorithmically efficient. Firstly,we transformed a very costly local histogram calculation to use integral images, which are naturally more paralleland work-efficient. gPb exploits global information in the image through spectral partitioning, implemented by aneigensolver. This part of the algorithm was also very computationally intensive, so we took advantage of the in-terdisciplinary and collaborative nature of the Par Lab to discuss the eigensolver with Professor Jim Demmel, whorecommended that we try a simpler eigensolver algorithm that avoids costly re-orthogonalization steps. At his advice,we tried a technique known as the Cullum-Willoughby test to avoid problematic spurious eigenvalues arising fromloss of orthogonality in the eigensolver, and it dramatically boosted the efficiency of our eigensolver.

Our parallel implementation on graphics processors, combined with these algorithmic improvements reduced gPbruntime from minutes to 2 seconds. We described this work in a paper published at a leading computer vision con-ference, the International Conference on Computer Vision in 2009. This paper is also included in this chapter. Ourresults greatly increased the applicability of gPb, ultrametric contour maps, and other feature extraction techniquesthat require high-quality image contours. Similarly to the SVM work, we open sourced our code, which has beendownloaded over 3300 times and continues to be used.

5 From Patterns to Frameworks: CopperheadOur experience with SVMs and contour detection convinced us of the importance of data parallelism, one of the paral-lel patterns we were developed. Although data parallelism is ubiquitous in both algorithms and parallel architectures,implementing data parallel programs is unnecessarily complicated. We developed a framework called Copperheadto improve programmer productivity while still providing good efficiency [4]. Copperhead compiles and runs a dataparallel subset of Python on parallel hardware, such as GPUs. The Copperhead work was an early realization of theSEJITS approach to designing high-productivity programming environments, and is included in the SEJITS chapterof this book. The examples we used to show the productivity and efficiency provided by Copperhead came from theCBIR project, including a complete SVM training application.

6 Related Work and Future DirectionsBeyond the two papers we include in this chapter, the CBIR effort at the Par Lab had several other sub-projects,primarily focused on feature extraction for images and videos, as well as image recognition. In [8], we examinedimplementation tradeoffs on parallel mobile platforms of the widely used SIFT feature extraction algorithm, a keybuilding block of many image classification algorithms. In [10] and [11], we investigated high-quality optical flowalgorithms for video segmentation. Finally, in [9], we extended the contour detector to create ultrametric contour mapsegmentations. We used these high-quality segmentations to classify the content of images. For example, if composed

Bryan Catanzaro and Kurt Keutzer 79


in a larger CBIR system, this would allow users to ask questions about the semantic content of an image—for example,search for images that contain a swan.

Future work involves combining the parallel feature extraction and classification primitives we have developedinto a complete media retrieval system, similar to the original PIRO, but with improved capabilities. We have shownthat parallelism can successfully be practically applied for high-quality algorithms found in image classification, butthere are still many more algorithms to investigate.

We are also continuing to apply and extend a patterns-oriented approach to computer visions applications. Initiallywe thought that application developers would be able to naturally decompose their applications directly into computa-tional patterns. Our experience with computer vision domain experts showed us that this was not the case. To remedythis problem, in his doctoral dissertation, Bor-Yiing Su mined computer vision code and research literature and iden-tified a set of application patterns that deal broadly with problems of pre-preprocessing images, extracting featuresfrom them, and classifying them. This work on identifying application patterns has been further extended to create 22application patterns that encompass both computer vision and multi-media processing. This is further discussed in thepaper PyCASP: A Pattern-oriented Framework for Parallel Programming found in the patterns chapter of this volume.

7 SummaryThe focus of our work has been to apply our pattern-driven approach to implementing parallel software to improve theperformance of key applications on a new generation of parallel processors, particularly those processors offering largeamounts of parallelism. As a result we worked with application experts on real applications published in applicationconferences. Both the ICML paper and the ICCV paper in this chapter contain performance comparisons betweenpreviously used, sequential implementations and our algorithmically improved, parallel implementations. Some [6]have argued that such publications are misleading as they do not represent traditional architectures in the best light. Aswe point out in [1], application researchers and domain experts are interested in getting the best performance, givensome fixed effort. Application researchers are interested in the hardware that will best highlight their application. Theyare not interested in carefully adjudicating how close to their best performance they might have achieved on alternativearchitectures. We are not trying to show the superiority of one architecture versus another; instead, our work shows thatalgorithmic improvements, coupled with efficient parallel implementation, can deliver significant performance gainsfor these algorithms, especially on very parallel processors. We explain in more detail how these speedup numbersshould be interpreted in [1].

Parallelism and careful algorithmic design brought large performance improvements to our applications, comparedwith the widely used sequential versions from which we started. We believe computer vision and machine learningapplications such as CBIR will continue to demand large amounts of computation, and be increasingly important inyears to come. The efficiency gains we realized change the way in which these algorithms can be applied, bringinghigh-quality image analysis and classification within reach of many more applications.

Bibliography[1] M. Anderson, B. Catanzaro, J. Chong, E. Gonina, K. Keutzer, C.-Y. Lai, M. Murph, D. Sheffield, B.-Y. Su, , and

N. Sundaram. Considerations when evaluating microprocessor platforms. In Hot Topics in Parallel Computing2011, Berkeley, 2011.

[2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, D. A. P. Kurt Keutzer, W. L. Plishker, J. Shalf,S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from Berkeley. Tech-nical report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[3] J. Y. Bouguet, C. Dulong, I. Kozintsev, and Y. Wu. Requirements for benchmarking personal image retrievalsystems. In In Electronic Imaging 2006, International Society for Optics and Photonics, 2006.

[4] B. Catanzaro, M. Garl, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. In Pro-ceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP ’11), pages47–56. ACM, New York, NY, 2011.

[5] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.

80 Chapter 3: Introduction


[6] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty,P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100× gpu vs. cpu myth: An evaluation of throughputcomputing on cpu and gpu. ACM SIGARCH Computer Architecture News, 38(3):451–460, June 2010.

[7] M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in naturalimages. In IEEE International Symposium on Workload Characterization, pages 138–147, 2008.

[8] M. Murphy, K. Keutzer, and H. Wang. Image feature extraction for mobile processors. In IEEE InternationalSymposium on Workload Characterization, 2009, pages 138–147, 2008.

[9] B.-Y. Su, T. Brutch, and Keutzer. A parallel region based object recognition system. In 2011 IEEE Workshop onApplications of Computer Vision (WACV), pages 81–88, 2011.

[10] N. Sundaram, T. Brox, and K. Keutzer. Dense point trajectories by gpu-accelerated large displacement opticalflow. In European Conference on Computer Vision, 2010, pages 438–451, 2010.

[11] N. Sundaram and K. Keutzer. Long term video segmentation through pixel level spectral clustering on gpus. In2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops),, pages 475–482, 2011.

Bryan Catanzaro and Kurt Keutzer 81

Efficient, High-Quality Image Contour DetectionBryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee,

Mark Murphy, and Kurt Keutzer

©2009 Institute of Electrical and Electronic Engineers, Inc. All rights reserved. Reprinted withthe permission of the IEEE. The original article appears as “Efficient, High Quality ImageContour Detection” by Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram, Yunsup Lee,Mark Murphy, and Kurt Keutzer, in the Proceedings of the 2009 International Conference onComputer Vision, pages 2381-2388, Kyoto, Japan, September 27, 2009. Personal use of thismaterial is permitted. However, permission to reuse this material for any other purpose mustbe obtained from the IEEE. DOI 10.1109/ICCV.2009.5459410

Abstract

Image contour detection is fundamental to many image analysis applications, including image segmentation, objectrecognition and classification. However, highly accurate image contour detection algorithms are also very computa-tionally intensive, which limits their applicability, even for offline batch processing. In this work, we examine efficientparallel algorithms for performing image contour detection, with particular attention paid to local image analysis aswell as the generalized eigensolver used in Normalized Cuts. Combining these algorithms into a contour detector,along with careful implementation on highly parallel, commodity processors from Nvidia, our contour detector pro-vides uncompromised contour accuracy, with an F-metric of 0.70 on the Berkeley Segmentation Dataset. Runtimeis reduced from 4 minutes to 1.8 seconds. The efficiency gains we realize enable high-quality image contour detec-tion on much larger images than previously practical, and the algorithms we propose are applicable to several imagesegmentation approaches. Efficient, scalable, yet highly accurate image contour detection will facilitate increasedperformance in many computer vision applications.

1 IntroductionWe present a set of parallelized image processing algorithms useful for highly accurate image contour detection andsegmentation. Image contour detection is closely related to image segmentation, and is an active area of research, withsignificant gains in accuracy in recent years. The approach outlined in [11], called gPb, achieves the highest publishedcontour accuracy to date, but does so at high computational cost. On small images of approximately 0.15 megapixels,gPb requires 4 minutes of computation time on a high-end processor. Many applications, such as object recognitionand image retrieval, could make use of such high quality contours for more accurate image analysis, but are still usingsimpler, less accurate image segmentation approaches due to their computational advantages.

At the same time, the computing industry is experiencing a massive shift towards parallel computing, driven by thecapabilities and limitations of modern semiconductor manufacturing [2]. The emergence of highly parallel processorsoffers new possibilities to algorithms which can be parallelized to exploit them. Conversely, new algorithms mustshow parallel scalability in order to guarantee increased performance in the future. In the past, if a particular algorithmwas too slow for wide application, there was reason to hope that future processors would execute the same code fastenough to make it practical. Unfortunately, those days are now behind us, and new algorithms must now express largeamounts of parallelism, if they hope to run faster in the future.

In this paper, we examine efficient parallel algorithms for image contour detection, as well as scalable implemen-tation on commodity, manycore parallel processors, such as those from Nvidia. Our image contour detector, built fromthese building blocks, demonstrates that high quality image contour detection can be performed in a matter of secondsrather than minutes, opening the door to new applications. Additionally, we show that our algorithms and implemen-tation scale with increasing numbers of processing cores, pointing the way to continued performance improvementson future processors.

82

Efficient, High-Quality Image Contour Detection

2 The gPb DetectorAs mentioned previously, the highest quality image contour detector currently known, as measured by the BerkeleySegmentation Dataset, is the gPb detector. The gPb detector consists of many modules, which can be grouped intotwo main components: mPb, a detector based on local image analysis at multiple scales, and sPb, a detector based onthe Normalized Cuts criterion. An overview of the gPb detector is shown in figure 1.

Image

Convert

(CIELAB)

Textons:

K-means

Localcues Combine

(mPb)

Intervening Contour

(W)

Generalized

Eigensolver

Combine

(sPb)

Combine, Thin,

Normalize

Contours

(gPb)

Figure 1: The gPb detector.

The mPb detector is constructed from brightness, color and texture cues at multiple scales. For each cue, thedetector from [13] is employed, which estimates the probability of boundary PbC,σ(x, y, θ) for a given image channel,scale, pixel, and orientation by measuring the difference in image channel C between two halves of a disc of radius σcentered at (x, y) and oriented at angle θ. The cues are computed over four channels: the CIELAB 1976 L channel,which measures brightness, and A, B channels, which measure color, as well as a texture channel derived from textonlabels [12]. The cues are also computed over three different scales [σ2 , σ, 2σ] and eight orientations, in the interval[0, π). The mPb detector is then constructed as a linear combination of the local cues, where the weights αij arelearned by training on an image database:

mPb(x, y, θ) =

4∑

i=1

3∑

j=1

αijPbCi,σj (x, y, θ) (2.1)

ThemPb detector is then reduced to a pixel affinity matrixW, whose elementsWij estimate the similarity betweenpixel i and pixel j by measuring the intervening contour [9] between pixels i and j. Due to computational concerns,Wij is not computed between all pixels i and j, but only for some pixels which are near to each other. In this case,we use Euclidean distance as the constraint, meaning that we only compute Wij ∀i, j s.t. ||(xi, yi) − (xj , yj)|| ≤ r,otherwise we set Wij = 0. In this case, we set r = 5.

This constraint, along with the symmetry of the intervening contour computation, ensures that W is a symmetric,sparse matrix (see figure 5), which guarantees that its eigenvalues are real, significantly influencing the algorithms usedto compute sPb. Once W has been constructed, sPb follows the Normalized Cuts approach [16], which approximatesthe NP-hard normalized cuts graph partitioning problem by solving a generalized eigensystem. To be more specific,we must solve the generalized eigenproblem:

(D −W )v = λDv, (2.2)

where D is a diagonal matrix constructed from W : Dii =∑jWij . Only the k + 1 eigenvectors vj with smallest

eigenvalues are useful in image segmentation and need to be extracted. In this case, we use k = 8. The smallesteigenvalue of this system is known to be 0, and its eigenvector is not used in image segmentation, which is why weextract k + 1 eigenvectors. After computing the eigenvectors, we extract their contours using Gaussian directionalderivatives at multiple orientations θ, to create an oriented contour signal sPbvj (x, y, θ). We combine the orientedcontour signals together based on their corresponding eigenvalues:

sPb(x, y, θ) =

k+1∑

j=2

1√λjsPbvj (x, y, θ) (2.3)

Bryan Catanzaro, Bor-Yiing Su, Narayanan Sundaram et al. 83


The final gPb detector is then constructed by linear combination of the local cue information and the sPb cue:

gPb(x, y, θ) = γ · sPb(x, y, θ) +

4∑

i=1

3∑

j=1

βijPbCi,σj (x, y, θ) (2.4)

where the weights γ and βij are also learned via training. To derive the final gPb(x, y) signal, we maximize over θ,threshold to remove pixels with very low probability of being a contour pixel, skeletonize, and then renormalize.

3 Algorithmic Exploration

3.1 Local CuesComputing the local cues for all channels, scales, and orientations is computationally expensive. There are two majorsteps: computing the local cues, and then smoothing them to remove spurious edges. We found significant efficiencygains in modifying the local cue computation to utilize integral images, so we will detail how this was accomplished.

3.1.1 Explicit Local Cues

Given an input channel, orientation, and scale, the local cue computation involves building two histograms per pixel,which describe the input channel’s intensity in the opposite halves of a circle centered at that pixel, with the orientationdescribing the angle of the diameter of the half-discs, and the scale determining the radius of the half-discs. The twohistograms are optionally blurred with a Gaussian, normalized, and then compared using the χ2 distance metric:

χ2(x, y) =1

2

∑

i

(xi − yi)2xi + yi

(3.1)

If the two histograms are significantly different, there is likely to be an edge at that pixel, orientation and scale.When computing these histograms by explicitly summing over half-discs, computation can be saved by noticing

that the computation for each orientation overlapped significantly with other orientations, so the histograms were com-puted for wedges of the circle, and then assembled into the various half-disc histograms necessary for each orientation.However, this approach does not consider that the circle overlapped with circles centered at neighboring pixels. Ad-ditionally, this approach recomputes the histograms completely for each of the different scales, and the computationnecessary is a function of the scale radius itself, meaning that larger scales incur significantly more computational costthan smaller scales. Furthermore, parallel implementations of this approach are complicated by the data-dependent na-ture of constructing histograms, which incurs higher synchronization costs than algorithms with static data dependencypatterns.

!

!

Figure 2: Approximating a half-disc with a rectangle.

3.1.2 Integral Images

To alleviate these problems, we turned to the well-known technique of integral images [10]. Integral images allow usto perform sums over rectangles in O(1) time instead of O(N) time, where N is the number of pixels in the rectangle.

84 Chapter 3: First Article


To construct an integral image, one computes I from an image F as

I(x, y) =

x∑

x′=1

y∑

y′=1

F (x′, y′) (3.2)

Computing the sum of a shape then involves summing as many entries from the integral image as there are cornersin the shape. For example, a rectangle with extent ranging from (x1, y1) to (x2, y2) is summed as follows:

x2∑

x=x1

y2∑

y=y1

F (x, y) = I(x1 − 1, y1 − 1)− I(x1 − 1, y2)− I(x2, y1 − 1) + I(x2, y2) (3.3)

We use integral images to compute histograms of the half-discs discussed previously. To do so, we approximate eachhalf-disc as a rectangle of equal area. Although integral images can be used efficiently for summing other shapes thanrectangles, we found that this approximation worked well.

We then compute an integral image for each bin of the histogram, similarly to [17]. Complicating the use ofintegral images in this context is the fact that integral images can only compute sums of rectangles, whereas we needto compute sums of rotated rectangles. Computing integral images for rotated images has been tried previously, butwas restricted to special angles, such as [7].

Our approach to rotated integral images reduces rotation artifacts and can handle arbitrary angles, based on the useof Bresenham lines [5]. The problem associated with computing integral images on rotated images is that standardapproaches to rotating an image interpolate between pixels. This is not meaningful for texton labels: since the labelsare arbitrary integers without a partial ordering, bin n bears no relation to bin n + 1, and therefore bin n + 0.5 hasno meaning. Nearest neighbor interpolation does not require interpolating the pixel values, but under rotation it omitssome pixels, while counting others multiple times, introducing artifacts. To overcome this, we rotate the image usingBresenham lines. This method ensures a one-to-one correspondence between pixels in the original image and pixelsin the rotated image, at the expense of introducing some blank pixels. The effect can be seen in Figure 3. Bresenhamrotation does introduce some discretization of the rotation angle, but this discretization tends to zero as the image sizeincreases1.

The Bresenham rotation produces images that are larger than the original image, but are bounded at (w + h)2

pixels, which bound is encountered at θ = π4 .

Although Bresenham rotation introduces some computational inefficiencies due to empty pixels, it is more accuratethan nearest neighbor interpolation, since pixels are not missed or multiply counted during the image integration, asoccurs using nearest neighbor interpolation. Therefore, we use it in our local cue detector.

Integral images for computing histograms over rectangles remove some of the computational complexity of thelocal cues extraction. The explicit method for image histogram creation has complexity O(Nr2st), where N is thenumber of pixels, r is the radius of the half-disc being extracted, s is the number of scales, and t is the number oforientations. It should be noted that some detectors might wish to scale r2 with N, making the complexity O(N2st).Using integral images reduces the complexity of histogram construction to O(Nst).

3.2 EigensolverThe generalized eigenproblem needed for Normalized Cuts is the most computationally intensive part of the gPbalgorithm. Therefore, an efficient eigensolver is necessary for achieving high performance. We have found that aLanczos-based eigensolver using the Cullum-Willoughby test without reorthogonalization provides the best perfor-mance on the eigenproblems generated by Normalized Cuts approaches. We also exploit the special structure andproperties of the graph Laplacian matrices generated for the Normalized Cuts algorithm in our eigensolver. Beforeexplaining our improvements, we present the basic algorithm used for solving these eigenproblems.

3.2.1 Lanczos Algorithm

The generalized eigenproblem from Normalized Cuts can be transformed into a standard eigenproblem [16]: Av = λv,with A = D−

12 (D −W )D−

12 .

1More analysis found in supplementary material



ww

h

h

w + h tan !

h + w tan !

Figure 3: Bresenham rotation: Rotated image with θ = 18◦ clockwise, showing “empty pixels” in the rotated image.

The matrix A is Hermitian, positive semi-definite, and its eigenvalues are well distributed. Additionally, we onlyneed a few of the eigenvectors, corresponding to the smallest k + 1 eigenvalues. Considering all the issues above, theLanczos algorithm is a good fit for this problem [3], and is summarized in Figure 4. The complete eigenproblem hascomplexity O(n3) where n is the number of pixels in the image, but the Lanczos algorithm is O(mn) +O(mM(n)),where m is the maximum number of matrix vector products, and M(n) is the complexity of each matrix vectorproduct, which is O(n) in our case. Empirically, m is O(n

12 ) or better for normalized cuts problems [16], meaning

that this algorithm scales at approximately O(n32 ) for our problems.

For a given symmetric matrixA, the Lanczos algorithm proceeds by iteratively building up a basis V , which is usedto project the matrix A into a tridiagonal matrix T . The eigenvalues of T are computationally much simpler to extractthan those of A, and they converge to the eigenvalues of A as the algorithm proceeds. The eigenvectors of A are thenconstructed by projecting the eigenvectors of T against the basis V . More specifically, vj denotes the Lanczos vectorgenerated by each iteration, Vj is the orthogonal basis formed by collecting all the Lanczos vectors v1, v2, . . . , vj incolumn-wise order, and Tj is the symmetric j × j tridiagonal matrix with diagonal equal to α1, α2, . . . , αj , and upperdiagonal equal to β1, β2, . . . , βj−1. S and Θ form the eigendecomposition of matrix Tj . Θ contains the approximationto the eigenvalues of A, while S in conjunction with V approximates the eigenvectors of A: xj = Vjsj .

There are three computational bottlenecks of the Lanczos algorithm: Matrix-vector multiplication, Reorthogonal-ization, and the eigendecomposition of the tridiagonal matrix Tj . We discuss reorthogonalization for Normalized Cutsproblems in Section 3.2.2, and the matrix-vector multiplication problem in Section 3.2.3. We solve the third bottleneckby diagonalizing Tj infrequently, since it is only necessary to do so when checking for convergence, which does notneed to be done at every iteration.

3.2.2 Reorthogonalization and the Cullum-Willoughby Test

In perfect arithmetic, the basis Vj constructed by the Lanczos algorithm is orthogonal. In practice, however, finitefloating-point precision destroys orthogonality in Vj as the iterations proceed. Many Lanczos algorithms preserveorthogonality by selectively reorthogonalizing new Lanczos vectors vj against the existing set of Lanczos vectorsVj−1. However, this is very computationally intensive. An alternative is to proceed without reorthogonalization,



Algorithm: LanczosInput: A (Symmetric Matrix)

v (Initial Vector)Output: Θ (Ritz Values)

X (Ritz Vectors)1 Start with r ← v ;2 β0 ← ‖r‖2 ;3 for j ← 1, 2, . . . , until convergence4 vj ← r/βj−1 ;5 r ← Avj ;6 r ← r − vj−1βj−1 ;7 αj ← v∗j r ;8 r ← r − vjαj ;9 Reorthogonalize if necessary ;10 βj ← ‖r‖2 ;11 Compute Ritz values Tj = SΘS ;12 Test bounds for convergence ;13 end for14 Compute Ritz vectors X ← VjS ;

Figure 4: The Lanczos algorithm.

as proposed by Cullum and Willoughby [6]. We have found that this alternative offers significant advantages forNormalized Cuts problems in image segmentation and image contour detection.

When Vj is not orthogonal, spurious and duplicate Ritz values will appear in Θ, which need to be identified andremoved. This can be done by constructing T as the tridiagonal matrix constructed by deleting the first row andfirst column of Tj . The spurious eigenvalues of Tj can then be identified by investigating the eigenvalues of T . Aneigenvalue is spurious if it exists in Tj only once and exists in T as well. For more details, see [6]. Because the lowereigenvalues of affinity matrices encountered from the Normalized Cuts approach to image segmentation are welldistributed, we can adopt the Cullum-Willoughby test to screen out spurious eigenvalues. This approach improvedeigensolver performance by a factor of 20× over full reorthogonalization, and 5× over selective reorthogonalization,despite requiring significantly more Lanczos iterations.

This approach to reorthogonalization can be generally applied to all eigenvalue problems solved as part of thenormalized cuts method for image segmentation. In general, the eigenvalues corresponding to the different cuts (seg-mentations) are well spaced out at the low end of the eigenspectrum. For the normalized Laplacian matrices withdimension N , the eigen values lie between 0 and N (loose upper bound) as tr[A] =

∑i λi = N and λi ≥ 0.

Since the number of eigenvalues is equal to the number of pixels in the image, one might think that as the number ofpixels increases, the eigenvalues will be more tightly clustered, complicating convergence analysis using the Cullum-Willoughby test. However, we have observed that this clustering is not too severe for the smallest eigenvalues ofmatrices derived from natural images, which are the ones needed by normalized cuts. As justification for this phe-nomenon, we observe that very closely spaced eigenvalues at the smaller end of the eigenspectrum would imply thatseveral different segmentations with different numbers of segments are equally important, which is unlikely in natu-ral images where the segmentation, for a small number of segments, is usually distinct from other segmentations. Inpractice, we have observed that this approach works very well for Normalized Cuts image segmentation computations.

3.2.3 Sparse Matrix Vector Multiplication (SpMV)

The Lanczos algorithm requires repeatedly multiplying the matrix by dense vectors; given a randomly initializedvector v0, this process generates the sequence of vectors A ·v0, A2 ·v0, . . .. As the matrix is very large (N ×N , whereN is the number of pixels in the image), and the multiplication occurs in each iteration of the Lanczos algorithm, this



operation accounts for approximately 2/3 of the runtime of the serial eigensolver.SpMV is a well-studied kernel in the domain of scientific computing, due to its importance in a number of sparse

linear algebra algorithms. A naïvely written implementation runs far below the peak throughput of most processors.The poor performance is typically due to low-efficiency of memory access to the matrix as well as the source anddestination vectors.

Figure 5: Example W matrix.

The performance of SpMV depends heavily on the structure of the matrix, as the arrangement of non-zeroes withineach row determine the pattern of memory accesses. The matrices arising from Normalized Cuts are all multiplybanded matrices, since they are derived from a stencil pattern where every pixel is related to a fixed set of neighboringpixels. Figure 5 shows the regular, banded structure of these matrices. It is important to note that the structure arisesfrom the pixel-pixel affinities encoded in the W matrix, but the A matrix arising from the generalized eigenproblemretains the same structure. Our implementation exploits this structure in a way that will apply to any stencil matrix.

In a stencil matrix, we can statically determine the locations of non-zeroes. Thus, we need not explicitly store therow and column indices, as is traditionally done for general sparse matrices. This optimization alone nearly halvesthe size of the matrix data structure, and doubles performance on nearly any platform. We store the diagonals of thematrix in consecutive arrays, enabling high-bandwidth unit-stride accesses, and reduce the indexing overhead to asingle integer per row. Utilizing similar optimizations as described in [4], our SpMV routine achieves 40 GFlops/s onmatrices derived from the intervening contour approach, with r = 5, leading to 81 nonzero diagonals.

4 Implementation and ResultsOur code was written in CUDA [15], and comprises parallel k-means, convolution, and skeletonization routines inaddition to the local cues and eigensolver routines. Space constraints prohibit us from detailing these routines, butit is important to note that our routines require CUDA architecture 1.1, with increased performance on the k-meansroutines on processors supporting CUDA architecture 1.2. The code from our implementation is freely available athttp://parlab.eecs.berkeley.edu/research/damascene.

4.1 Accuracy4.1.1 Berkeley Segmentation Dataset

Firstly, we need to show that our algorithms have not degraded the contour quality. We evaluate the quality of ourcontour detector by the BSDS benchmark [14]. As shown in Figure 6, we achieve the same F-metric (0.70) as the gPbalgorithm [11], and the quality of our P-R curve is also very competitive to the curve generated by the gPb algorithm.Figure 7 illustrates contours for several images generated by our contour detector.

4.1.2 Larger Images

To investigate the accuracy of our contour detector on larger images, we repeated this precision-recall test on 4 images.To generate this data, we hand labeled 4 images multiple times to create a human ground truth test for larger images2.

2Labeled images and results in supplementary material



0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

Prec

isio

n

Recall

This work (F=0.70)

gPb (F=0.70)

Figure 6: Precision Recall Curve for our Contour Detector.

Figure 7: Selected image contours.

We used our existing contour detector, without retraining or changing the scales, and found that the contour detectorworked on larger images as well, with an indicated F-metric of 0.75. Obviously, our test set was very small, so we arenot claiming that this is the realistic F-metric on larger images, rather we are simply showing that the detector providesreasonable results on larger images.

4.2 Runtime

To compare runtimes, we use the published gPb code, running on an Intel Core i7 920 (2.66 GHz) with 4 cores and 8threads. The original gPb code is written mostly in C++, coordinated by MATLAB scripts, as well as MATLAB’s eigseigensolver, which is based on ARPACK and is reasonably optimized. We found MATLAB’s eigensolver performedsimilarly to TRLan [18] on Normalized Cuts problems.

Although most of the computation in gPb was done in C++, there was one routine which was implemented inMATLAB and performed unacceptably: the convolutions required for local cue smoothing. In order to make ourruntime comparisons fair, we wrote our own parallel convolution routine, taking full advantage of SIMD & threadparallelism on the Intel processor, and report the runtime using our convolution routine instead of the one whichaccompanies the gPb code.

To be conservative in our comparisons of our fully parallelized implementation with the serial gPb detector, wealso took advantage of thread-level parallelism in our Intel convolution routine, and allowed MATLAB to parallelizethe eigensolver over our 8 threaded Core i7 processor. This means that a completely serial version of gPb would besomewhat slower than the version we compare against.

Comparisons between gPb and this work are found in table 1.



Table 1: Runtimes in seconds (0.15 MP image).

ComponentgPb This work

Speedup(Core i7) (GTX 280)

Preprocess 0.090 0.001 90 ×Textons 8.58 0.159 54 ×Local Cues 53.18 0.569 93 ×Smoothing 0.59 0.270 2.2×Int. Contour 6.32 0.031 204 ×Eigensolver 151.2 0.777 195 ×Post Process 2.7 0.006 450 ×

Total 236.7 1.822 130 ×

Table 2: GPU Scaling. Runtimes in seconds.

Processor Preprocess TextonsLocalCues

Smoothing Int.Contour Eigensolver Postprocess Total

8600M GT 0.010 10.337 7.761 2.983 0.300 7.505 0.041 28.962

9800 GX2 0.003 2.311 1.226 0.530 0.056 1.329 0.009 5.497

GTX 280 0.001 0.159 0.569 0.270 0.031 0.777 0.006 1.822

Tesla C1060 0.002 0.178 0.584 0.267 0.03 1.166 0.006 2.243

4.3 Algorithmic ImprovementsTo isolate the algorithmic efficiency gains from the implementation efficiency gains, we examine the performance ofthe local cues extraction and the eigensolver.

Table 3: Local Cues Runtimes on GTX 280.

Local cues Explicit Method Integral Images

Runtime (s) 4.0 0.569

The explicit local cues method utilizes a parallelized version of the same histogram building approach found ingPb: it explicitly counts all pixels in each half-disc, for each orientation and scale. As shown in table 3, the integralimage approach is about about 7× more efficient than the explicit method.

Table 4 shows the effect of various reorthogonalization strategies. Full reorthogonalization ensures that every newLanczos vector vj is orthogonal to all previous vectors. Selective reorthogonalization monitors the loss of orthogonal-ity in the basis and performs a full reorthogonalization only when the loss of orthogonality is numerically significantto within machine floating-point tolerance. The strategy we use, as outlined earlier, is to forgo reorthogonalization,and use the Cullum-Willoughby test to remove spurious eigenvalues due to loss of orthogonality. As shown in thetable, this approach provides a 20× gain in efficiency.

4.4 ScalabilityWe ran our detector on a variety of commodity, single-socket graphics processors from Nvidia, with widely varyingdegrees of parallelism. These experiments were performed to demonstrate that our approach scales to a wide variety



Table 4: Eigensolver Runtimes on GTX 280.

Eigensolver Reorthogonalization Full Selective None (C-W)

Runtime (s) 15.83 3.60 0.78

of processors. The exact specifications of the processors we used can be found in Table 5.

Table 5: Processor Specifications.

Processormodel

CoresMultiprocessors

MemoryBandwidth

GB s−1

ClockFrequency

GHz

AvailableMemory

MB

8600M GT 4 12.8 0.92 2569800 GX2 16 64 1.51 512GTX 280 30 141.7 1.30 1024C1060 30 102 1.30 4096

0 0.1 0.2 0.3 0.4 0.5 0.6

0 8 16 24 32

Imag

es p

er se

cond

Number of Cores

Parallel Scalability

Figure 8: Performance scaling with parallelism (0.15 MP images).

Figure 8 shows how the runtime of our detector scales with increasingly parallel processors, with more detailsin table 2. Each of the 4 processors we evaluated is represented on the plot of performance versus the number ofcores. We have two processors with the same number of cores, but different amounts of memory bandwidth, whichexplain the different results at 30 cores. Clearly, our work efficiently capitalizes on parallel processors, which gives usconfidence that performance will continue to increase on future generations of manycore processors.

Figure 9 demonstrates the runtime dependence on input image size. These experiments were all run on the TeslaC1060 processor, since we require its large memory capacity to compute contours on the larger images. Runtimedependence is mostly linear in the number of pixels over this range of image sizes.

5 ConclusionIn this work, we have demonstrated how the careful choice of parallel algorithms along with implementation onmanycore processors can enable high quality, highly efficient image contour detection. We have detailed how onecan use integral images to improve efficiency by replacing histogram construction with parallel prefix operations evenunder arbitrary rotations. We have also shown how eigenproblems encountered in Normalized Cuts approaches toimage segmentation can be efficiently solved by the Lanczos algorithm with Cullum-Willoughby test.

Combining these contributions to create a contour detector, we show that runtime can be reduced over 100×, whilestill providing equivalent contour accuracy. We have also shown how our routines allow us to find image contours forlarger images, and detailed how our detector scales across processors with widely varying amounts of parallelism. This



0 5

10 15 20

0.E+00 5.E+05 1.E+06 2.E+06

Tim

e (s

econ

ds)

Pixels

Image Size Scalability

Figure 9: Runtime scaling with increased image size

makes us confident that future, even more parallel manycore processors will continue providing increased performanceon image contour detection.

Future work includes using the components we have developed in other computer vision problems. It is possiblethat doing more image analysis with our optimized components will allow for yet higher image contour detectionquality. Our contours could be integrated into a method which produces image segments, such as [1], which can bemore natural in some applications, such as object recognition [8]. Other possibilities are also open, such as video seg-mentation. We believe that the efficiency gains we realize will allow for high quality image segmentation approachesto be more widely utilized in many contexts.

6 AcknowledgementsThanks to Michael Maire for suggesting we investigate integral images, and Pablo Arbeláez for assisting us withthe gPb algorithm. Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding, andby matching funding from U.C. Discovery (Award #DIG07-10227). We acknowledge the support of the GigascaleSystems Research Center, funded under the Focus Center Research Program, a Semiconductor Research CorporationProgram.

Bibliography[1] P. Arbeláez, M. Maire, and J. Malik. From contours to regions: An empirical evaluation. In CVPR, 2009.

[2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View fromBerkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec2006.

[3] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. Templates for the solution of Algebraic EigenvalueProblems: A Practical Guide. SIAM, 2000.

[4] N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput oriented processors.In Supercomputing ’09, Nov. 2009.

[5] J. Bresenham. Algorithm for computer control of a digital plotter. IBM Systems Journal, 4(1):25–30, 1965.

[6] J. K. Cullum and R. A. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Vol. I:Theory. SIAM, 2002.

[7] S. Du, N. Zheng, Q. You, Y. Wu, M. Yuan, and J. We. Rotated haar-like features for face detection with in-planerotation. LNCS, 4270/2006:128–137, 2006.

[8] C. Gu, J. Lim, P. Arbeláez, and J. Malik. Recognition using regions. In CVPR, 2009.



[9] T. Leung and J. Malik. Contour continuity in region based image segmentation. In In Proc. ECCV, LNCS 1406,pages 544–559. Springer-Verlag, 1998.

[10] R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In Proc. IEEE Conf.on Image Processing, pages 155–162, New York, USA, 2002.

[11] M. Maire, P. Arbeláez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in naturalimages. CVPR, pages 1–8, June 2008.

[12] J. Malik, S. Belongie, J. Shi, and T. Leung. Textons, contours and regions: Cue integration in image segmenta-tion. In ICCV ’99, page 918, Washington, DC, USA, 1999. IEEE Computer Society.

[13] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using brightness and texture,2002.

[14] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its applicationto evaluating segmentation algorithms and measuring ecological statistics. In ICCV 2001, volume 2, pages 416–423, July 2001.

[15] Nvidia. Nvidia CUDA, 2007. http://nvidia.com/cuda.

[16] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEETransactions on, 22(8):888–905, Aug 2000.

[17] M. Villamizar, A. Sanfeliu, and J. Andrade-Cetto. Computation of rotation local invariant features using theintegral image for real time object detection. In Int’l. Conf. on Pattern Recognition, 2006.

[18] K. Wu and H. Simon. Thick-restart lanczos method for large symmetric eigenvalue problems. SIAM Journal onMatrix Analysis and Applications, 22(2):602–616, 2001.


Fast Support Vector Machine Training and Classification onGraphics Processors

Bryan Catanzaro, Narayanan Sundaram, Kurt Keutzer

This article originally appeared as “Fast Support Vector Machine Training and Classificationon Graphics Processors” by Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer, in theICML ’08: Proceedings of the 25th International Conference on Machine Learning, pages104-111, July 2008. Reprinted by permission of the authors.

Abstract

Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high perfor-mance implementations of machine learning algorithms. We describe a solver for Support Vector Machine trainingrunning on a GPU, using the Sequential Minimal Optimization algorithm and an adaptive first and second order work-ing set selection heuristic, which achieves speedups of 9-35× over LIBSVM running on a traditional processor. Wealso present a GPU-based system for SVM classification which achieves speedups of 81-138× over LIBSVM (5-24×over our own CPU based SVM classifier).

1 Introduction

Driven by the capabilities and limitations of modern semiconductor manufacturing, the computing industry is currentlyundergoing a massive shift towards parallel computing [1]. This shift brings dramatically enhanced performance tothose algorithms which can be adapted to parallel computers.

One set of such algorithms are those used to implement Support Vector Machines [7]. Thanks to their robustgeneralization performance, SVMs have found use in diverse classification tasks, such as image recognition, bioin-formatics, and text processing. Yet, training Support Vector Machines and using them for classification remains verycomputationally intensive. Much research has been done to accelerate training time, such as Osuna’s decompositionapproach [17], Platt’s Sequential Minimal Optimization (SMO) algorithm [18], Joachims’ SVM light [13], which in-troduced shrinking and kernel caching, and the working set selection heuristics used by LIBSVM [9]. Despite thisresearch, SVM training time is still significant for large training sets.

In this paper, we show how Support Vector Machine training and classification can be adapted to a highly parallel,yet widely available and affordable computing platform: the graphics processor, or more specifically, the NvidiaGeForce 8800 GTX, and detail the performance gains achieved.

The organization of the paper is as follows. Section 2 describes the SVM training and classification problemsbriefly. Section 3 gives an overview of the architectural and programming features of the GPU. Section 4 presents thedetails of implementation of the parallel SMO approach on the GPU. Section 5 explains the implementation details ofthe SVM classification problem. We present our results in Section 6 and conclude in Section 7.

2 Support Vector Machines

We consider the standard two-class soft-margin SVM classification problem (C-SVM), which classifies a given datapoint x ∈ Rn by assigning a label y ∈ {−1, 1}.

94

Fast Support Vector Machine Training and Classification on Graphics Processors

2.1 SVM TrainingGiven a labeled training set consisting of a set of data points xi, i ∈ {1, ..., l} with their accompanying labelsyi, i ∈ {1, ..., l}, the SVM training problem can be written as the following Quadratic Program:

maxα

F (α) =

l∑

i=1

αi −1

2αTQα

subject to 0 ≤ αi ≤ C, ∀i ∈ 1 . . . l

yTα = 0

(2.1)

where xi ∈ Rn is training data point i, yi ∈ {−1, 1} is the label attached to point xi, and αi is a set of weights, onefor each training point, which are being optimized to determine the SVM classifier. C is a parameter which tradesclassifier generality for accuracy on the training set, and Qij = yiyjΦ(xi, xj), where Φ(xi, xj) is a kernel function.We consider the standard kernel functions shown in Table 1.

Table 1: Standard kernel functions.

LINEAR

POLYNOMIAL

GAUSSIAN

SIGMOID

Φ(xi, xj) = xi · xjΦ(xi, xj ; a, r, d) = (axi · xj + r)d

Φ(xi, xj ; γ) = exp{−γ||xi − xj ||2

}

Φ(xi, xj ; a, r) = tanh(axi · xj + r)

2.1.1 SMO Algorithm

The SVM Training problem can be solved by many methods, each with different parallelism implications. We have im-plemented the Sequential Minimal Optimization algorithm [18], with a hybrid working set selection heuristic makinguse of the first order heuristic proposed by [14] as well as the second order heuristic proposed by [9].

The SMO algorithm is a specialized optimization approach for the SVM quadratic program. It takes advantageof the sparse nature of the support vector problem and the simple nature of the constraints in the SVM QP to reduceeach optimization step to its minimum form: updating two αi weights. The bulk of the computation is then to updatethe Karush-Kuhn-Tucker optimality conditions for the remaining set of weights and then find the next two weights toupdate in the next iteration. This is repeated until convergence. We state this algorithm briefly, for reference purposes.

Algorithm 1: Sequential Minimal Optimization.Input: training data xi, labels yi, ∀i ∈ {1..l}Initialize: αi = 0, fi = −yi, ∀i ∈ {1..l},Initialize: bhigh, blow, ihigh, ilowUpdate αihigh and αilowrepeat

Update fi, ∀i ∈ {1..l}Compute: bhigh, ihigh, blow, ilowUpdate αihigh and αilow

until blow ≤ bhigh + 2τ

For the first iteration, we initialize bhigh = −1, ihigh = min{i : yi = 1}, blow = 1, and ilow = min{i : yi = −1}.During each iteration, once we have chosen ihigh and ilow, we take the optimization step:

α′ilow = αilow + yilow(bhigh − blow)/η (2.2)α′ihigh = αihigh + yilowyihigh(αilow − α′ilow) (2.3)

where η = Φ(xihigh , xihigh) + Φ(xilow , xilow) − 2Φ(xihigh , xilow). To ensure that this update is feasible, α′ilow andα′ihigh must be clipped to the valid range 0 ≤ αi ≤ C.

Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer 95


The optimality conditions can be tracked through the vector fi =∑lj=1 αjyjΦ(xi, xj)− yi, which is constructed

iteratively as the algorithm progresses. After each α update, f is updated for all points. This is one of the majorcomputational steps of the algorithm, and is done as follows:

f ′i = fi + (α′ihigh − αihigh)yihighΦ(xihigh , xi) + (α′ilow − αilow)yilowΦ(xilow , xi) (2.4)

In order to evaluate the optimality conditions, we define index sets:

Ihigh = {i : 0 < αi < C} ∪ {i : yi > 0, αi = 0} ∪ {i : yi < 0, αi = C} (2.5)Ilow = {i : 0 < αi < C} ∪ {i : yi > 0, αi = C} ∪ {i : yi < 0, αi = 0} (2.6)

Because of the approximate nature of the solution process, these index sets are computed to within a tolerance ε, e.g.{i : ε < αi < (C − ε)}.

We can then measure the optimality of our current solution by checking the optimality gap, which is the differencebetween bhigh = min{fi : i ∈ Ihigh}, and blow = max{fi : i ∈ Ilow}. When blow ≤ bhigh + 2τ , we terminate thealgorithm.

2.1.2 Working Set Selection

During each iteration, we need to choose ihigh and ilow, which index the α weights which will be changed in thefollowing optimization step. The first order heuristic from [14] chooses them as follows:

ihigh = arg min{fi : i ∈ Ihigh} (2.7)ilow = arg max{fi : i ∈ Ilow} (2.8)

The second order heuristic from [9] chooses ihigh and ilow to optimize the unconstrained SVM functional. Anoptimal approach to this problem would require examining

(l2

)candidate pairs, which would be computationally

intractable. To simplify the problem, ihigh is instead chosen as in the first order heuristic, and then ilow is chosento maximally improve the objective function while still guaranteeing progress towards the constrained optimum fromproblem (2.1). More explicitly:

ihigh = arg min{fi : i ∈ Ihigh} (2.9)ilow = arg max{∆Fi(α) : i ∈ Ilow, fihigh < fi} (2.10)

After choosing ihigh, we compute for all i ∈ {1..l}βi = fihigh − fi (2.11)ηi = Φ(xihigh , xihigh) + Φ(xi, xi)− 2Φ(xihigh , xi) (2.12)

∆Fi(α) = β2i /ηi (2.13)

We then find the maximum ∆Fi over all valid points (i ∈ Ilow) for which we are guaranteed to progress towards theconstrained optimum (fihigh < fi).

2.1.3 Adaptive Heuristic

The second order heuristic utilizes more information from the SVM training problem, and so it generally reduces thenumber of iterations necessary during the solution process. However, it is more costly to compute. In our GPU imple-mentation, the geometric mean of iteration time over our benchmark set using the second order heuristic increased by1.9× compared to the first order heuristic. On some benchmarks, the total number of iterations decreased sufficientlyto provide a significant speedup overall, but on others, the second order heuristic is counterproductive for our GPUimplementation.

To overcome this problem, we implemented an adaptive heuristic that chooses between the two selection heuristicsdynamically, with no input or tuning from the user. The adaptive heuristic periodically samples progress towardsconvergence as a function of wall-clock time using both heuristics, then chooses the more productive heuristic.

This sampling occurs every l/10 iterations, and during each sample, the heuristic under test is executed for twophases of 64 iterations each. The average optimality gap in each of these phases is computed, and then the rate ofprogress is estimated by dividing the change in the optimality gap over the two phases by the time it has taken toexecute them. The same sampling process is then performed with the other heuristic, and the best heuristic is thenused until the next sampling period.

96 Chapter 3: Second Article


2.2 SVM Classification

The SVM classification problem is as follows: for each data point z which should be classified, compute

z = sgn

{b+

l∑

i=1

yiαiΦ(xi, z)

}(2.14)

where z ∈ Rn is a point which needs to be classified, and all other variables remain as previously defined.From the classification problem definition, it follows immediately that the decision surface is defined by referenc-

ing a subset of the training data, or more specifically, those training data points for which the corresponding αi > 0.Such points are called support vectors.

Generally, we classify not just one point, but a set of points. We exploit this for better performance, as explainedin Section 5.

3 Graphics Processors

Graphics processors are currently transitioning from their initial role as specialized accelerators for triangle rasteriza-tion to general purpose engines for high throughput floating-point computation. Because they still service the largegaming industry, they are ubiquitous and relatively inexpensive.

GPU architectures are specialized for compute-intensive, memory-intensive, highly parallel computation, andtherefore are designed such that more resources are devoted to data processing than caching or control flow. Stateof the art GPUs provide up to an order of magnitude more peak IEEE single-precision floating-point than their CPUcounterparts. Additionally, GPUs have much more aggressive memory subsystems, typically endowed with more than10x higher memory bandwidth than a CPU. Peak performance is usually impossible to achieve on general purposeapplications, yet capturing even a fraction of peak performance yields significant speedup.

GPU performance is dependent on finding high degrees of parallelism: a typical computation running on the GPUmust express thousands of threads in order to effectively use the hardware capabilities. As such, we consider it anexample of future “many-core” processing [1]. Algorithms for machine learning applications will need to considersuch parallelism in order to utilize many-core processors. Applications which do not express parallelism will notcontinue improving their performance when run on newer computing platforms at the rates we have enjoyed in thepast. Therefore, finding large scale parallelism is important for compute performance in the future. Programming forGPUs is then indicative of the future many-core programming experience.

3.1 Nvidia GeForce 8800 GTX

In this project, we employ the NVIDIA GeForce 8800 GTX GPU, which is an instance of the G80 GPU architecture,and is a standard GPU widely available on the market. Pertinent facts about the GPU platform can be found in Table2. We refer the reader to the Nvidia CUDA reference manual for more details [16].

Table 2: Nvidia GeForce 8800 GTX characteristics.

# OF STREAM PROCESSORS 128PEAK GENERAL PURPOSE IEEE SP 346 GFlopsMULTIPROCESSOR LOCAL STORE SIZE 16 kBCLOCK RATE 1.35 GHzMEMORY CAPACITY 768 MBMEMORY BANDWIDTH 86.4 GB/sCPU GPU BANDWIDTH 3.2 Gbit/s



3.2 CUDA

Nvidia provides a programming environment for its GPUs called the Compute Unified Device Architecture (CUDA).The user codes in annotated C++, accelerating compute intensive portions of the application by executing them on theGPU.

!"#$

!%&'(%)*+,&"-

.%&/0)1

2&/(%)34&"+

5+6#74+"7

89"+($)1

5+6#74+"7

89"+($)!:::

.%&/0)!

2&/(%)34&"+

5+6#74+"7

89"+($)1

5+6#74+"7

89"+($)!::::::

Figure 1: Logical organization of the GeForce 8800.

Figure 1 illustrates how the GPU appears to the programmer. The programmer organizes the computation intogrids, which are organized as a set of thread blocks. The grids run sequentially on the GPU, meaning that all com-putation in the grid must finish before another grid is invoked. As mentioned, grids contain thread blocks, which arebatches of threads that execute together, sharing local memories and synchronizing at programmer specified barriers.A maximum of 512 threads can comprise a thread block, which puts a limit on the scope of synchronization and com-munication in the computation. However, enormous numbers of blocks can be launched in parallel in the grid, so thatthe total number of threads that can be launched in parallel is very high. In practice, we need a large number of threadblocks to ensure that the compute power of the GPU is efficiently utilized.

4 SVM Training Implementation

Since GPUs need a large number of threads to efficiently exploit parallelism, we create one thread for every data pointin the training set. For the first phase of the computation, each thread computes f ′i from equation (2.4). We then applya working set selection heuristic to select the next points which will be optimized. The details are explained in thefollowing section.

4.1 Map Reduce

At least since the LISP programming language, programmers have been mapping independent computations ontopartitioned data sets, using reduce operations to summarize the results. Recently, Google proposed a Map Reducevariant for processing large datasets on compute clusters [8]. This algorithmic pattern is very useful for extractingparallelism, since it is simple to understand, and maps well to parallel hardware, given the inherent parallelism in themap stage of the computation.

The Map Reduce pattern has been shown to be useful for many machine learning applications [5], and is a naturalfit for our SVM training algorithm. For the first order heuristic, the computation of f ′i for all points is the map function,and the search for blow, bhigh, ilow and ihigh is the reduction operation. For the second order heuristic, there are twoMap Reduce stages: one to compute f ′i , bhigh and ihigh, and another where the map stage computes ∆Fi for all points,while the reduce stage computes blow and ilow.

Because the CUDA programming model has strict limitations on synchronization and communication betweenthread blocks, we organize the reductions in two phases, as shown in figure 2. The first phase does the map compu-



Map +

Local

Reduce

Global

Reduce

Figure 2: Structuring the Map Reduce.

tation, as well as a local reduce within a thread block. The second phase finishes the global reduction. Each phase ofthis process is implemented as a separate call to the GPU.

4.2 Implementation Details4.2.1 Caching

Since evaluating the kernel function Φ(·) is the dominant part of the computation, it is useful to cache as much aspossible from the matrix of kernel function evaluations Kij = Φ(xi, xj) [13]. We compute rows of this matrix on thefly, as needed by the algorithm, and cache them in the available memory on the GPU.

When updating the vector f , we need access to two rows of K, since we have changed exactly two entries in α. Inour system, the CPU checks to see which of these two rows, if any, are present in the cache. If a row is not present, theCPU voids the least recently used row of the cache, and assigns it to the new row which is needed. For the rows whichhit in the cache, the GPU avoids doing the kernel evaluations. Otherwise, the GPU writes out the appropriate row orrows after computing the kernel values. When using the second order heuristic, the computation of ∆F references therow of K corresponding to ihigh, which guarantees that the next update of f will have a cache hit for its access to thesame row.

4.2.2 Data Movement

Programming the GPU requires manually copying data from the host computer to the GPU and vice versa, and it alsorequires manually copying data from the GPU’s global memory to the fast local stores. As mentioned previously, ifthe cache does not contain a particular row of K corresponding to the point xj , that row will need to be generated,which means that we need to compute Φ(xi, xj) ∀i ∈ 1..l. Since the vector xj is shared between all computations, weload it into the GPU’s local store. This is key to performance, since accessing the local store is orders of magnitudefaster than accessing the global memory.

4.3 Related WorkThere have been previous attempts to parallelize the SVM training problem. The most similar to ours is [4], whichparallelizes the SMO algorithm on a cluster of computers using MPI. Both our approach and their approach use theconcurrency inherent in the KKT condition updates as the major source of parallelism. However, in terms of imple-mentation, GPUs present a completely different model than clusters, and hence the amount of parallelism exploited,such as the number of threads, granularity of computation per thread, memory access patterns, and data partitioningare very different. We also implement more sophisticated working set selection heuristics.

Many other approaches for parallelizing SVM training have been presented. The cascade SVM [11] is anotherproposed method for parallelizing SVM training on clusters. It uses a method of divide and conquer to solve largeSVM problems. [21] parallelize the underlying QP solver using Parallel Gradient Projection Technique. Work hasbeen done on using a parallel Interior Point Method for solving the SVM training problem [20]. [6] proposes a methodwhere the several smaller SVMs are trained in a parallel fashion and their outputs weighted using a Artificial Neural



Network. [10] implement a gradient based solution for SVM training, which relies on data parallelism in computingthe gradient of the objective function for an unconstrained QP optimization at its core. Some of these techniques, forexample, the training set decomposition approaches like the Cascade SVM are orthogonal to the work we describe,and could be applied to our solver. [3] give an extensive overview of parallel SVM implementations. We implementedthe parallel SMO training algorithm because of its relative simplicity, yet high performance and robust convergencecharacteristics.

5 SVM Classification ImplementationWe approached the SVM classification problem by making use of Map Reduce computations as well as vendor sup-plied Basic Linear Algebra Subroutines - specifically, the Matrix Matrix Multiplication routine (SGEMM), whichcalculates C ′ = αAB + βC, for matrices A, B, and C and scalars α and β. For the Linear, Polynomial, and Sigmoidkernels, calculating the classification value involves finding the dot product between all test points and the support vec-tors, which is done through SGEMM. For the Gaussian kernel, we use the simple identity ||x−y||2 = x·x+y ·y−2x·yto recast the computation into a Matrix Matrix multiplication, where the SGEMM computes Dij = −γ||zi − xj ||2 =2γ(zi · xj) − γ(zi · zi + xj · xj), for a set of unknown points z and a set of support vectors x. We then apply a mapreduce computation to combine the computed D values to get the final result.

Continuing the Gaussian example, the map function exponentiates Dij element wise, multiplies each column ofthe resulting matrix by the appropriate yjαj . The reduce function sums the rows of the matrix and adds b to obtainthe final classification for each data point as given by equation (2.14). Other kernels require similar Map Reducecalculations to finish the classification.

6 ResultsThe SMO implementation on the GPU is compared with LIBSVM, as LIBSVM uses Sequential Minimal Optimizationfor SVM training. We used the Gaussian kernel in all of our experiments, since it is widely employed.

6.1 TrainingWe tested the performance of our GPU implementation versus LIBSVM on the datasets detailed in Tables 3 and 4.

Table 3: Datasets - References and training parameters.

DATASET C γ

ADULT [2] 100 0.5

WEB [18] 64 7.8125

MNIST [15] 10 0.125

USPS [12] 10 2−8

FOREST [2] 10 0.125

FACE [19] 10 0.125

The sizes of the datasets are given in Table 4. References for the datasets used and the (C, γ) values used for SVMtraining are provided in Table 3.

We ran LIBSVM on an Intel Core 2 Duo 2.66 GHz processor, and gave LIBSVM a cache size of 650 MB, whichis larger than our GPU implementation was allowed. CPU-GPU communication overhead was included in the solverruntime, but file I/O time was excluded for both our solver and LIBSVM. Table 5 shows results from our solver. FileI/O varies from 1.2 seconds for USPS to about 12 seconds for Forest dataset. The CPU - GPU data transfer overheadwas also very low. The time taken to transfer the training data to the GPU and copy the results back was less than 0.6seconds, even for our largest dataset (Forest).

Since any two solvers give slightly different answers on the same optimization problem, due to the inexact nature ofthe optimization process, we show the number of support vectors returned by the two solvers as well as how close the



Table 4: Dataset size.

DATASET # POINTS # DIMENSIONS

ADULT 32,561 123

WEB 49,749 300

MNIST 60,000 784

USPS 7,291 256

FOREST 561,012 54

FACE 6,977 381

final values of b were for the GPU solver and LIBSVM, which were both run with the same tolerance value τ = 0.001.As shown in the table, the deviation in number of support vectors between the two solvers is less than 2%, and thedeviation in the offset b is always less than 0.1%. Our solver provides equivalent accuracy to the LIBSVM solver,which will be shown again in the classification results section.

Table 5: SVM training convergence comparison.

DATASETNUMBER OF SVS

DIFFERENCEIN b (%)

GPUADAPTIVE

LIBSVM

ADULT 18, 674 19, 058 −0.004

WEB 35, 220 35, 232 −0.01

MNIST 43, 730 43, 756 −0.04

USPS 684 684 0.07

FOREST 270, 351 270, 311 0.07

FACE 3, 313 3, 322 0.01

Table 6: SVM training results.

DATASETGPU 1ST ORDER GPU 2ND ORDER GPU ADAPTIVE LIBSVM SPEEDUP

(×)(ADAPTIVE)ITER. TIME (S) ITER. TIME (S) ITER. TIME (S) ITER. TIME (S)

ADULT 114,985 30.15 40,044 30.46 64,446 26.92 43,735 550.2 20.4

WEB 79,749 174.17 81,498 290.23 70,686 163.89 85,299 2422.46 14.8

MNIST 68,055 475.42 67,731 864.46 68,113 483.07 76,385 16 965.79 35.1

USPS 6,949 0.596 3,730 0.546 4,734 0.576 4,614 5.092 8.8

FOREST 2,070,867 4571.17 236,601 1441.08 450,506 2023.24 275,516 66 523.53 32.9

FACE 6,044 1.30 4,876 1.30 5,535 1.32 5,342 27.61 20.8

Table 6 contains performance results for the two solvers. We see speedups in all cases from 9× to 35×. Forreference, we have shown results for the solvers using both heuristics statically. Examining the data shows that theadaptive heuristic performs robustly, surpassing or coming close to the performance of the best static heuristic on allbenchmarks.

6.2 ClassificationResults for our classifier are presented in Table 8. We achieve 81−138× speedup over LibSVM on the datasets shown.As with the solver, file I/O times were excluded from overall runtime. File I/O times vary from 0.4 seconds for Adultdataset to about 6 seconds for MNIST dataset.



6.2.1 Optimizations to CPU Based Classifier

LIBSVM classifies data points serially. This effectively precludes data locality optimizations and produces significantslowdown. It also represents data in a sparse format, which can cause overhead as well.

To optimize the CPU classifier, we performed the following:

1. We changed the data structure used for storing the support vectors and test vectors from a sparse indexed set toa dense matrix.

2. To maximize performance, we used BLAS routines from the Intel Math Kernel Library to perform operationssimilar to those mentioned in Section 5.

3. Wherever possible, loops were parallelized (2-way for the dual-core machine) using OpenMP.

These optimizations improved the classification speed on the CPU by a factor of 3.4− 28.3×. The speedup numbersfor the different datasets are shown in Table 8. It should be noted that the GPU version is better than the optimizedCPU versions by a factor of 4.9− 23.9×.

For some insight into these results, we note that the optimized CPU classifier performs best on problems witha large number of input space dimensions, which helps make the SVM classification process compute bound. Forproblems with a small number of input space dimensions, the SVM classification process is memory bound, meaningit is limited by memory bandwidth. Since the GPU has much higher memory bandwidth, as noted in Section 3, it iseven more attractive for such problems.

We tested the combined SVM training and classification process for accuracy by using the SVM classifier producedby the GPU solver with the GPU classification routine, and used the SVM classifier provided by LIBSVM’s solver toperform classification with LIBSVM. Thus, the accuracy of the classification results presented in Table 7 reflect theoverall accuracy of the GPU solver and GPU classifier system. The results are identical, which shows that our GPUbased SVM system is as accurate as traditional CPU based methods.

Table 7: Accuracy of GPU SVM classification vs. LIBSVM.

DATASETGPU LIBSVM

ACCURACY ACCURACY

ADULT 6619/8000 6619/8000

WEB 3920/4000 3920/4000

MNIST 2400/2500 2400/2500

USPS 1948/2007 1948/2007

FACE 23665/24045 23665/24045

7 Conclusion

This work has demonstrated the utility of graphics processors for SVM classification and training. Training time isreduced by 9− 35×, and classification time is reduced by 81− 138× compared to LIBSVM, or 5− 24× over our ownCPU based SVM classifier. These kinds of performance improvements can change the scope of SVM problems whichare routinely solved, increasing the applicability of SVMs to difficult classification problems. For example, training aclassifier for an input data set with almost 600000 data points and 50 dimensions takes only 34 minutes on the GPU,compared with over 18 hours on the CPU.

The GPU is a very low cost way to achieve such high performance: the GeForce 8800 GTX fits into any moderndesktop machine, and currently costs $300. Problems which used to require a compute cluster can now be solvedon one’s own desktop. New machine learning algorithms that can take advantage of this kind of performance, byexpressing parallelism widely, will provide compelling benefits on future many-core platforms.



Table 8: Performance of GPU SVM classifier compared to LIBSVM and Optimized CPU classifier.

LIBSVM CPU OPTIMIZED CLASSIFIER GPU CLASSIFIER

SPEEDUP (×)COMPARED TO

SPEEDUP (×)COMPARED TO

DATASET TIME (S) TIME (S) LIBSVM TIME (S) LIBSVM CPU OPTIMIZED CODE

ADULT 61.307 7.476 8.2 0.575 106.6 13.0

WEB 106.835 15.733 6.8 1.063 100.5 14.8

MNIST 269.880 9.522 28.3 1.951 138.3 4.9

USPS 0.777 0.229 3.4 0.009 58 81.1 23.9

FACE 88.835 5.191 17.1 0.705 126.0 7.4

8 Acknowledgements

The authors acknowledge the support of the Gigascale Systems Research Center, one of five research centers fundedunder the Focus Center Research Program, a Semiconductor Research Corporation program. Bryan Catanzaro isalso supported by a National Science Foundation Graduate Research Fellowship. The authors thank the anonymousreviewers for their comments and suggestions.

Bibliography

[1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View fromBerkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec2006.

[2] A. Asuncion and D. Newman. UCI machine learning repository, 2007.

[3] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. Large-Scale Kernel Machines. The MIT Press, 2007.

[4] L. Cao, S. Keerthi, C.-J. Ong, J. Zhang, U. Periyathamby, X. J. Fu, and H. Lee. Parallel sequential minimaloptimization for the training of support vector machines. IEEE Transactions on Neural Networks, 17:1039–1049, 2006.

[5] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learningon multicore. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information ProcessingSystems 19, pages 281–288. MIT Press, Cambridge, MA, 2007.

[6] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of svms for very large scale problems. NeuralComputation, 14(5):1105–1114, 2002.

[7] C. Cortes and V. Vapnik. Support-vector networks. Mach. Learn., 20(3):273–297, 1995.

[8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI’04: Proceedingsof the 6th Symposium on Operating Systems Design & Implementation, Berkeley, CA, USA, 2004. USENIXAssociation.

[9] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training supportvector machines. J. Mach. Learn. Res., 6:1889–1918, 2005.

[10] L. V. Ferreira, E. Kaskurewicz, and A. Bhaya. Parallel implementation of gradient-based neural networks forsvm training. International Joint Conference on Neural Networks, Apr 2006.



[11] H. P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascadesvm. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17,pages 521–528. MIT Press, Cambridge, MA, 2005.

[12] J. J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell.,16(5):550–554, 1994.

[13] T. Joachims. Making large-scale support vector machine learning practical. In Advances in kernel methods:support vector learning. MIT Press, Cambridge, MA, USA, 1999.

[14] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to Platt’s SMO Algorithmfor SVM Classifier Design. Neural Comput., 13(3):637–649, 2001.

[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998.

[16] Nvidia. Nvidia CUDA, 2007. http://nvidia.com/cuda.

[17] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. NeuralNetworks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop, pages 276–285, 1997.

[18] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in kernelmethods: support vector learning, pages 185–208. MIT Press, Cambridge, MA, USA, 1999.

[19] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on PatternAnalysis and Machine Intelligence, 20(1):23–38, 1998.

[20] G. Wu, E. Chang, Y. K. Chen, and C. Hughes. Incremental approximate matrix factorization for speeding upsupport vector machines. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 760–766, New York, NY, USA, 2006. ACM Press.

[21] L. Zanni, T. Serafini, and G. Zanghirati. Parallel software for training large scale support vector machines onmultiprocessor systems. J. Mach. Learn. Res., 7:1467–1492, 2006.


Documents

The Content-Based Image Retrieval Project...The Content-Based Image Retrieval Project Bryan Catanzaro and Kurt Keutzer 1 Introduction The Content Based Image Retrieval Project was