A Hardware Implementation of Hierarchical Clustering for ...youngcho/cse560m/finalpres/hwclust.pdf · hierarchical clustering, the clusters are themselves recursively clustered. Well-known

A Hardware Implementation of HierarchicalClustering for Text Documents

Dan [email protected]

Moshe [email protected]

Shobana [email protected]

Abstract— Clustering is a form of unsupervised learningwhere a collection of unlabeled data points is partitionedinto groups (clusters) according to some similarity metric,maximizing similarity intracluster while minimizing sim-ilarity intercluster. In hierarchical clustering, the groupsare clustered into subgroups, etc., organizing the data asa tree. Many clustering algorithms are superlinear, whilepertinent datasets are growing in size exponentially. Sinceclustering algorithms also tend to be highly parallelizable,there is a strong motivation for speedup via hardwareacceleration. We present a simple hierarchical clusteringalgorithm for text documents, adapt it for hardwareacceleration, and implement on the FPX platform. Toour knowledge, this is the first work to present a hier-archical clustering algorithm designed for straightforwardhardware implementation, and ours is the first hardware-accelerated implementation of a hierarchical clusteringalgorithm.

I. MOTIVATION AND OVERVIEW

As databases of interest grow exponentiallylarger, data-mining algorithms are struggling to keepup (cf. [10], [4]). This is an broad phenomenon,affecting at least the following areas:

• Gene sequences and other bioinformatics datasources

• Sales and other business transactions• The www• Various data sources utilized within the intelli-

gence community

Thus, there is compelling motivation to movedata-mining algorithms into hardware in order toaccelerate them (i.e., take advantage of specializedcircuitry). In this paper, we focus on a particulardata-mining problem, the hierarchical clustering oftext documents based on document similarity. Toour knowledge, this is the first work to presenta hierarchical clustering algorithm designed for

straightforward hardware implementation. 1

Section II briefly reviews the clustering problemin general, with a particular focus on the clusteringof text documents. Section III describes our cluster-ing algorithm. The FPX platform and LEON2 Pro-cessor, with we have chosen for our implementation,are introduced in Section IV, and their selectionjustified. In Section VI we then describe how theprimary computation of the algorithm (scoring) maybe offloaded onto a specialized circuit, describethe speedup in Section VII), and conclude in Sec-tion VIII.

II. THE CLUSTERING PROBLEM

The clustering problem is to partition a givena set of items into subsets, typically maximizingintracluster similarity while minimizing interclustersimilarity, according to a given similarity metric. Inhierarchical clustering, the clusters are themselvesrecursively clustered. Well-known “flat” (as opposedto hierarchical) clustering algorithms include k-means and its generalization, expectation maximiza-tion [7].

A well-studied special case is where the itemscorrespond to text documents and the similaritymetric is based on viewing a document as a “bag-of-words”, or “bag-of-word-disjunctions” [16], [17].That is, documents are points in a high-dimensionalspace where each dimension corresponds to thefrequency/occurrence of one or more words in thedocument.

There are a number of important factors to con-sider in crafting an approach to text document clus-tering which have been identified in the literature:

1A number of versions of the flat (non-hierarchical) k-meansalgorithm have been implemented in hardware, most frequently forimage clustering (cf. [5]).

• The similarity measure used is as importantas the algorithm chosen; normalized measuressuch as the cosine measure,

s(~x, ~y) =~x · ~y

||~x||2 · ||~y||2, (1)

are better suited than unnormalized distancemetrics (i.e., Euclidean or Manhattan) [17].

• Hierarchical top-down approaches where theset of documents is recursively bisected out-perform flat clustering algorithms such as k-means, as well as hierarchical bottom-up clus-tering methods based on agglomeration [16].

• Approaches based on the occurrence of phrases(n-grams) generally do not outperform simplerapproaches based on single words [3].

• The dimensionality of the space utilized maybe reduced by a number of means without ad-versely impacting the quality of the results [15]

III. CLUSTERING VIA HIERARCHICAL

PARTITIONING

Hierarchical Partitioning (HP) clustering may bedefined as follows:

• Input: A set of N k-dimensional points (vec-tors) in a binary space.

• Output: A binary tree where each point corre-sponds to exactly one leaf.

Partition the points (from the root of the treedownward) into sets left and right, maximizingscore(left) + score(right). The algorithm termi-nates when left and right are individual points.The function score maps sets of points to reals,and measures cohesiveness. Specifically, for a setof points C = {~a1,~a2, . . . ,~an}, let

~sum(C) =n∑

i=1

~ai (2)

in

score(C) =1

n

n∑

i=1

~sum(C)·~ai

|~ai|. (3)

Figure 1 shows the clustering results for asubset of 160 documents from the popular 20-newsgroup dataset [1] (the first 40 documents eachfrom the groups rec.sport.hockey, rec.sport.baseball,talk.politics.mideast, and rec.motorcycles). The treehas been automatically pruned based on cluster

quality at each level. Data dimensionality is reducedto 4000-wide vectors (from a dimensionality in thehundreds of thousands) via a word mapping tableas described in [10].

A. Performance Considerations

To approximate the optimal partitioning, hill-climbing is used; the set of points is randomly par-titioned by randomly assigning each point to eitherleft or right. Moves of a single point between leftand right which increase the sum of the scores aretaken until no more such moves are possible (i.e., alocal minimum is reached).

To avoid using floating-point numbers, score(C)is approximated by

⌊(

1

n

n∑

i=1

⌊

~sum(C)·~ai·k

|~ai|

⌋)⌋

. (4)

The magnitudes, |~ai|, may be easily precomputedand cached in an array. For each computation ofscore(C), we only need to compute ~sum(C) once.Each entry in the summation vector is stored in onlyeight bits. Thus, we obtain the following C code:

score_t score(iterator from,iterator to) {

score_t sc=0;base_t sum[K];iterator it;

set_all_to_zero(sum);for (it=from;it!=to;++it)

add_to(*it,sum);

for (it=from;it!=to;++it)sc+= (K*dot_prod(sum,*it))

/magnitude(*it);sc/=distance(from,to);

return sc;}

The asymptotically dominant operations here are:1) Calls to the add to function, which adds a k-

wide vector of base ts with a k-wide vectorof bits.

2) Calls to the dot prod function, which com-putes the dot product of a k-wide vector ofbase ts with a k-wide vector of bits.

(160)

(95) (65)

(43) (52)

35% motorcycles (20) (23)

40% motorcycles (20) 100% motorcycles (3)

(23) (29)

66% baseball (12) 63% hockey (11) 43% hockey (16) 53% baseball (13)

(42) 56% hockey (23)

66% mideast (21) 71% mideast (21)

Fig. 1. Clustering results for a subset of the 20-newsgroup dataset. In parentheses are the total number of documents in a node’s subtree.Leaves show the most category present (e.g., motorcycles) and it’s frequency.

Experimentally, almost all of the algorithm’s ex-ecution time is evenly divided between add to anddot prod.

IV. THE LIQUID ARCHITECTURE PLATFORM

AND THE LEON2 PROCESSOR

Liquid architecture platform is developed by theLiquid architecture group [9] of Washington Uni-versity. The following description is from [14].

The platform aims to measures and im-proves application performance, by pro-viding an easily and efficiently reconfig-urable architecture, along with softwaresupport to expedite its use.The Liquid architecture system has beenimplemented as an extensible hardwaremodule on the Field-programmable PortExtender (FPX) platform [12], which isdescribed in Section V.

The liquid platform instantiates LEON V2, a softcore processor used by European Space Agency [6]for embedded systems development. LEON is opensource and hence can be downloaded for free.

A. LEON Processor

Figure IV-A shows the fairly sophisticated com-puter architecture model of LEON - separate in-struction and data caches, full SPARC V8 instruc-tion set, 5-stage pipeline, generic co-processor inter-face and separate (standard AMBA) buses for high-speed memory access and low-speed peripheral con-trol.

We use the co-processor interface to integrate ourscoring circuit.

B. Co-processor Interface

The following description of the co-processorinterface is from LEON user manual [2].

The generic co-processor interface ofLEON allows an execution unit to op-erate in parallel to increase performance.One coprocessor instruction can be startedeach cycle as long as there are no datadependencies. When finished, the result iswritten back to the co-processor registerfile.The timing (waveform) diagram for theexecution unit interface can be seen inFigure V. The execution unit is startedby asserting the start signal together witha valid opcode. The operands are drivenon the following cycle together with theload signal. If the instruction will takemore than one cycle to complete, theexecution unit must drive busy from thecycle after the start signal was asserted,until the cycle before the result is valid.The result, condition codes and excep-tion information are valid from the cycleafter the de-assertion of busy, and untilthe next assertion of start. The opcode(cpi.opcode[9:0]) is the concatenation ofbits [19,13:5] of the instruction. If exe-cution of a co-processor instruction needto be prematurely aborted (due to an IUtrap), cpi.flush will be asserted for twoclock cycles. The execution unit shouldthen be reset to its idle condition.

Fig. 2. High-level architecture of LEON processor.

Fig. 3. Co-processor execution unit timing diagram.

C. Implementation of Scoring through Co-processor Interface

2

Implementing the integration of our scoring cir-cuit through the co-processor interface, involved thefollowing.

2We thank Phillip Jones of Liquid architecture group for his helpwith integrating our scoring circuit through the co-processor interface.

• device.vhd was modified to enable co-processor interface.

• hclust1eu.vhd was created based on fp1eu.vhdprovided with LEON2 distribution. This com-ponent gets instantiated in proc.vhd. It containsthe implementation of co-processor interfaceintegration, in particular, the co-processor reg-ister file management which is being done inthe opcode decode process.

Opcode encodings for scoring circuit

Opcode EncodingHCLUST STG1 111100001HCLUST STG2 111100010HCLUST STG1 RESET 111100110HCLUST STG3 111100011HCLUST RESET 111111111HCLUST NOP 111100000

• sparcv8.vhd was modified to include opcodedefinitions for the scoring circuit. Opcodes aredefined for stage1, stage2, stage1 reset, stage3,reset and nop. Their encodings are shown inFigure IV-C.

• hcluster cp core.vhd was created to instantiatethe scoring circuit. The scoring circuit itself isdefined Section VI.

D. Cross Compiler

Liquid platform uses sparc-elf cross compilersuite distributed with LEON2. The following is adescription in the distributed document.

The compiler is based on the GNU com-piler tools and the Newlib standalone C-library. The cross-compiler allows compi-lation of sequential (non-tasking) C andC++ applications. It supports both hardand soft floating-point operations, as wellas both V7 and V8 multiply and divideinstructions.

Packages distributed in the suite include:• GNU GCC C/C++ compiler v3.2.3• Newlib C-library v 1.12.0• Libio low-level I/O routines for LEON2• GDB debugger v5.3

E. Liquid Platform User Interface

Liquid platform provides a web interface as wellas command line options for the following:

• LEON status - to check if LEON has startedup.

• Load program - to load a program into memory,at a specific address.

• Start LEON - to instruct LEON to execute theprogram that has been loaded into memory ata given address.

Fig. 4. FPX Platform used by Liquid architecture platform.

• Read memory - to read a specified number ofbytes at a given address. This can be used toverify the program that was loaded, or to readthe results of a program which writes results tomemory.

• Get statistics - to get program execution time(number of hardware clock cycles), cachestatistics (number of cache reads, writes, hits,misses), etc.

• Reset LEON - to reset LEON.

V. FPX PLATFORM

The following description is from [14].

The FPX platform, developed by Wash-ington University’s Reconfigurable Net-work Group [13], provides an environ-ment where a circuit implemented inFPGA hardware can be interfaced withSRAM, SDRAM, and high speed networkinterfaces. Hardware modules on the FPXcan use some or all of a large XilinxVirtex XCV2000E FPGA to implement alogic function [8]. By using the FPX plat-form, resulting hardware modules can berapidly implemented, deployed and testedwith live data [11].The Liquid architecture system leveragesFPX platform to evaluate customizationsof the soft core processor and to controlexperiments remotely over the Internet.

VI. A SCORING CIRCUIT

We would thus like to design a specialized scor-ing circuit to speed up hierarchical partitioning (i.e.,parallelization). Consider a set of n documents,represented as vectors in a k dimensional space.Assume that we can parallel-process on words ofm bits at a time, and that m < k. We can considera k-dimensional vector ~a as k/m m-dimensionalsubvectors ~a1,~a2, ...,~ak/m, where ~ai contains ~a’svalues along dimensions i(k − 1) + 1 through ik.

Our asymptotically dominant operations – com-puting the dot product of each vector with thesum of all vectors in a set – can occur in k/mindependent m-bit wide slices (if k is not a multipleof m, we can zero-pad it). Specifically, for every ~ain a set,

(n∑

i=1

~ai) · ~a (5)

will be computed as

k/m∑

i=1

(k∑

j=1

~aij) · ~a

i. (6)

In our current implementation for LEON2 (seeabove), m = 64. Slices are processed one at atime (i.e., left-to-right, from 1 to m/k). Since thecoprocessor interface allows for two 64-bit valuesas simultaneous inputs, corresponding slices fromtwo vectors are sent simultaneously, until a cor-responding slice has been sent for every vectorin the set (if there is an odd number of vectors,we simply send a vector of zeros along with thefirst vector). After the ith slice has been sent, thescoring circuit may compute

∑kj=1

~aij , and, for each

vector ~a, (∑k

j=1~ai

j) · ~ai. These partial dot products

are accumulated as more slices are processed, suchthat when the entire set of vectors has been sent,the full dot product be may read back for everyvector. These values are then normalized (dividedby the magnitude of the corresponding vector) andsummed to obtain the final score.

The circuit was implemented as a coprocessorto the LEON processor on the liquid architecture.The coprocessor communicates with LEON via thecoprocessor interface using commands comprised oftwo 64 bit inputs, a 9 bit opcode, and a 64 bitresult. Additionally, the coprocessor is limited in

that it cannot access main memory itself, it mustgo through LEON anytime a memory access isnecessary. The circuit is responsible for performingtwo major operations of the clustering algorithm:computing the bitwise sum of the document bit-vectors and calculating the dot-product of each ofthe documents and the bitwise sums. These two op-erations are accomplished in three separate stages.The first stage computes the bitwise sum, the secondstage computes the dot product, and the third stagereturns the final score for each of the documents.The following circuit description uses as an examplea cluster comprised of four documents (A,B,C,D)each of which is represented as a bit-vector oflength 4000. The first stage of the circuit begins byreceiving the first 64 bits of the first two documentsin the cluster (A & B). Each bit of the first documentis then added to the corresponding bit in the seconddocument and the result is stored to a register.This creates a total of 64 registers containing thebitwise sum of the first 64 bits of the first twodocuments. As the sums are being computed, thetwo 64 bit portions of the two documents are alsostored in a FIFO local to the coprocessor (FIFO1).This allows for the coprocessor to access this datain the future without having to go through LEON.This entire process is then repeated with the first64 bits of the third and fourth documents (C & D).The registers containing the bitwise sums would beupdated to reflect not only the sums of the firsttwo documents, but also of the third and fourthdocuments. It will eventually be necessary to repeatthis stage (4000/64) * 4 = 250 times, each callreading in the next 64 bit s of each document untilthe entire document has been processed. After thefirst 64 bits of all of the documents in the clusterhave passed through the first stage of the circuit,the second stage is ready for execution. This stageis responsible for computing the partial dot-productof each of the documents and the bitwise sums. Thestage begins by first copying the 64 bitwise sumscomputed in stage one. Next, it reads the first 64bits of the first document from FIFO1 and computesthe dot-product of it and the 64 bitwise sums. Thispartial dot-product of the document is then storedin an additional FIFO (FIFO2). This entire processcontinues until all of the documents entered in stageone have had their partial dot-product computed.

After the first two stages of the circuit have beenexecuted the required number of times, the finalstage of the circuit is ready for execution. Stagethree reads each of the partial dot-products fromFIFO2 (which at this point have become the fulldot-products) and returns it as the result to LEON.The stage that the scoring circuit executes is basedon opcodes that are sent in the software developer’scode. Each of the aforementioned 3 stages has acorresponding opcode which the software developersends in order to start those stages. In addition tothe opcodes for the three main stages, there are threeadditional opcodes: no-operation opcode, stage onereset opcode, and a full reset opcode. The followingis a pseudo code example of how these opcodeswould be used to score the previously mentionedexample cluster.

for index=0 to 64Stg1[ A(index to index+63),

B(index to index+63) ]Stg1[ C(index to index+63),

D(index to index+63) ]Stg2[ ]Stg1Reset[ ]

//1 nop needed for each documentNOP[ ]NOP[ ]NOP[ ]NOP[ ]ScoreForDocA = Stg3[ ]ScoreForDocB = Stg3[ ]ScoreForDocC = Stg3[ ]ScoreForDocD = Stg3[ ]FullReset[ ]

VII. PERFORMANCE GAINS

To estimate the performance gain from the circuit,let’s first consider the case of the unacceleratedalgorithm; we need to first compute ~sum, thencompute ~sum ·~a for every ~a in our set. Given a setof n k-dimensional vectors, the summation requiresk add operation for every vector, giving a total of nkoperations (we are not counting operations that arecommon to both the unaccelerated and acceleratedversions). Computing the dot products requires anadd and a multiply for each element (thus 2nkoperations)– giving an overall computational over-head of 3nk operations.

For the accelerated circuit, we require (1.5nk)/64operations to send the data to the circuit, simultane-ously computing ~sum. Computing the dot-productsis pipelined with this operation, so the additionoverhead from this step is only 2k/64 operations.Finally, 4n operations are required to read backthe results. The total is thus 0.023nk + 0.31k + 4noperations.

The actual speedup for clustering N vectors willdepend on k, as well as how deep a clusteringhierarchy we with to construct. Consider k = 4000;in this case, the unaccelerated algorithm will take12000 operations, while the accelerated version willtake 0.023·4000·k + 0.31·4000 + 4n = 99n + 125operations. Thus, the speedup factor will be between54 (12000/54) and 121 (12000/99).

VIII. CONCLUSIONS

We have presented novel work which addressesthe demands of high-volume datatmining by movingthe bottleneck operations of a hierarchical clusteringalgorithm to specialized hardware. A significantspeedup is achieved via this specialization. Weexpect that future implementations following thisdesign methodology on more powerful hardwarewill be able to exploit an even greater level of par-allelism, and to significantly outperform commodityPCs.

REFERENCES

[1] http://people.csail.mit.edu/jrennie/20newsgroups/.[2] http://www.gaisler.com/products/leon2/leon.html/.[3] Ron Bekkerman and James Allan. Using bigrams in text

categorization. Technical Report CIIR Technical Report IR-408, University of Massachusetts, 2004.

[4] Roger D. Chamberlain, Ron K. Cytron, Mark A. Franklin, andRonald S. Indeck. The mercury system: Exploiting truly fasthardware for data search. In Proc. of Int’l Workshop on StorageNetwork Architecture and Parallel I/Os, 2003.

[5] Mike Estlick, Miriam Leeser, James Theiler, and John J. Szy-manski. Algorithmic transformations in the implementation ofk- means clustering on reconfigurable hardware. In Proceedingsof the 2001 ACM/SIGDA ninth international symposium onField programmable gate arrays, 2001.

[6] Gaisler Research. http://www.gaisler.com.[7] Jiawei Han and Micheline Kamber. Data mining: concepts and

techniques. Morgan Kaufmann Publishers Inc., 2000.[8] Edson L. Horta, John W. Lockwood, David E. Taylor, and David

Parlour. Dynamic hardware plugins in an FPGA with partialrun-time reconfiguration. In Design Automation Conference(DAC), New Orleans, LA, June 2002.

[9] Liquid Architecture Group. http://liquid.arl.wustl-.edu.

Fig. 5. The scoring circuit

[10] J. W. Lockwood, S. G. Eick, D. J. Weishar, R. Loui, J. Moscola,C. Kastner, A. Levine, and M. Attig. Transformation algorithmsfor data streams. In IEEE Aerospace Conference, 2005.

[11] John W Lockwood. Evolvable Internet hardware platforms.In The Third NASA/DoD Workshop on Evolvable Hardware(EH’2001), pages 271–279, July 2001.

[12] John W. Lockwood. The Field-programmable Port Ex-tender (FPX). http://www.arl.wustl.edu/arl/-projects/fpx/, December 2003.

[13] John W. Lockwood. Reconfigurable network group.http://www.arl.wustl.edu/arl/projects/-fpx/reconfig.htm, May 2004.

[14] Shobana Padmanabhan, Phillip Jones, David V. Schuehler,Scott J. Friedman, Praveen Krishnamurthy, Huakai Zhang,Roger Chamberlain, Ron K. Cytron, Jason Fritts, and John W.Lockwood. Extracting and improving microarchitecture perfor-mance on reconfigurable architectures. International Journal ofParallel Programming, 33(2–3):115–136, June 2005.

[15] H. Schutze and C. Silverstein. Projections for efficient docu-ment clustering. 1998.

[16] M. Steinbach, G. Karypis, and V. Kumar. A comparison ofdocument clustering techniques. 2000.

[17] Alexander Strehl, Joydeep Ghosh, and Raymond Mooney.Impact of similarity measures on web-page clustering. InProceedings of the 17th National Conference on ArtificialIntelligence: Workshop of Artificial Intelligence for Web Search(AAAI 2000). AAAI, 2000.