17
Peer-to-Peer Distributed Text Classifier Learning in PADMINI Xianshu Zhu 1, Tushar Mahule 1 , Haimonti Dutta 2 , Sugandha Arora 1 , Hillol Kargupta 1‡ and Kirk Borne 3 1 Department of CSEE, University of Maryland, Baltimore County, Baltimore, MD, USA 2 The Center for Computational Learning Systems, Columbia University, New York, NY, USA 3 Department of Computational and Data Sciences, George Mason University, Fairfax, VA, USA Received 30 March 2011; revised 22 March 2011; accepted 24 May 2012 DOI:10.1002/sam.11155 Published online in Wiley Online Library (wileyonlinelibrary.com). Abstract: Popular Internet document repositories, such as online newspapers, digital libraries, and blogs store large amount of text and image data that are frequently accessed by large number of users. Users’ input through collaborative commenting or tagging can be very useful in organizing and classifying documents. Some web sites (e.g. Google Image Labeler) support a collection of tags and labels, but a large fraction of these sites do not currently support such activities. Moreover, relying upon centrally controlled web-service providers for such support is probably not a good idea if the objective is to make the collaborative inputs publicly available. Often, business entities offering such web-based tagging environments end up owning and monetizing the result of the collective effort. This paper takes a step toward addressing this problem—it proposes a peer-to-peer (P2P) system (PADMINI), powered by distributed data mining algorithms. In particular, it focuses on learning a P2P classifier from tagged text data. This paper describes the PADMINI system and the distributed text classifier learning components; text classification is posed as a linear program and an asynchronous distributed algorithm is used to solve it. It also presents extensive empirical results on text data obtained from the Hubble Space Telescope (HST) proposal abstract database. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012 Keywords: peer-to-peer system; distributed data mining; collaborative tagging; annotation; distributed linear programming 1. INTRODUCTION Popular Internet document repositories store large amount of unstructured text and image data that are frequently accessed by large number of users. Online news sites (e.g. news.yahoo.com), digital libraries (e.g. IEEE, ACM), astronomy text data repositories (e.g. adswww.harvard.edu), and blogs (e.g. www.blogger.com) are some examples. Some of these sites allow online users to post their tags and comments on the web documents. Most do not for various reasons—lack of business moti- vation to support that and lack of resources to support such features. Even if the web site offers support for user Correspondence to: Xianshu Zhu [email protected] This paper is an extended version of our previous paper that was published in the ICDM workshop on Mining Multiple Information Sources, 2009. The author is also affiliated to Agnik LLC., Columbia, MD, USA feedback, because the data is under control by the web site owner it can be modified, exploited, and censored to the benefit of the site owner. Therefore, a centralized solu- tion to collect and analyze such user feedbacks may not be the best approach. Because viewing the information in the browser first requires downloading data in the visitors’ machine at different locations, a distributed archiecture may be a possibility. This paper describes a web-based peer-to-peer (P2P) system—PADMINI 1 —powered by distributed data mining algorithms that run on a collection of compute nodes form- ing a P2P network. It focuses on P2P classifier learning from tagged text data. In this paper, we present an algo- rithm for distributed P2P text classification with bounds on convergence and communication cost. The architecture of the PADMINI system has four main components: (i) web 1 http://padmini.cs.umbc.edu/padmini © 2012 Wiley Periodicals, Inc.

Peer-to-peer distributed text classifier learning in PADMINI

Embed Size (px)

Citation preview

Page 1: Peer-to-peer distributed text classifier learning in PADMINI

Peer-to-Peer Distributed Text Classifier Learning in PADMINI†

Xianshu Zhu1∗, Tushar Mahule1, Haimonti Dutta2, Sugandha Arora1, Hillol Kargupta1‡ and Kirk Borne3

1Department of CSEE, University of Maryland, Baltimore County, Baltimore, MD, USA

2The Center for Computational Learning Systems, Columbia University, New York, NY, USA

3Department of Computational and Data Sciences, George Mason University, Fairfax, VA, USA

Received 30 March 2011; revised 22 March 2011; accepted 24 May 2012DOI:10.1002/sam.11155

Published online in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Popular Internet document repositories, such as online newspapers, digital libraries, and blogs store large amountof text and image data that are frequently accessed by large number of users. Users’ input through collaborative commentingor tagging can be very useful in organizing and classifying documents. Some web sites (e.g. Google Image Labeler) supporta collection of tags and labels, but a large fraction of these sites do not currently support such activities. Moreover, relyingupon centrally controlled web-service providers for such support is probably not a good idea if the objective is to make thecollaborative inputs publicly available. Often, business entities offering such web-based tagging environments end up owningand monetizing the result of the collective effort. This paper takes a step toward addressing this problem—it proposes apeer-to-peer (P2P) system (PADMINI), powered by distributed data mining algorithms. In particular, it focuses on learninga P2P classifier from tagged text data. This paper describes the PADMINI system and the distributed text classifier learningcomponents; text classification is posed as a linear program and an asynchronous distributed algorithm is used to solve it. It alsopresents extensive empirical results on text data obtained from the Hubble Space Telescope (HST) proposal abstract database.© 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012

Keywords: peer-to-peer system; distributed data mining; collaborative tagging; annotation; distributed linear programming

1. INTRODUCTION

Popular Internet document repositories store largeamount of unstructured text and image data that arefrequently accessed by large number of users. Onlinenews sites (e.g. news.yahoo.com), digital libraries (e.g.IEEE, ACM), astronomy text data repositories (e.g.adswww.harvard.edu), and blogs (e.g. www.blogger.com)are some examples. Some of these sites allow online usersto post their tags and comments on the web documents.Most do not for various reasons—lack of business moti-vation to support that and lack of resources to supportsuch features. Even if the web site offers support for user

Correspondence to: Xianshu [email protected]

† This paper is an extended version of our previous paperthat was published in the ICDM workshop on Mining MultipleInformation Sources, 2009.‡The author is also affiliated to Agnik LLC., Columbia, MD, USA

feedback, because the data is under control by the website owner it can be modified, exploited, and censored tothe benefit of the site owner. Therefore, a centralized solu-tion to collect and analyze such user feedbacks may notbe the best approach. Because viewing the information inthe browser first requires downloading data in the visitors’machine at different locations, a distributed archiecture maybe a possibility.

This paper describes a web-based peer-to-peer (P2P)system—PADMINI1 —powered by distributed data miningalgorithms that run on a collection of compute nodes form-ing a P2P network. It focuses on P2P classifier learningfrom tagged text data. In this paper, we present an algo-rithm for distributed P2P text classification with bounds onconvergence and communication cost. The architecture ofthe PADMINI system has four main components: (i) web

1 http://padmini.cs.umbc.edu/padmini

© 2012 Wiley Periodicals, Inc.

Page 2: Peer-to-peer distributed text classifier learning in PADMINI

2 Statistical Analysis and Data Mining, Vol. (In press)

server, (ii) distributed data mining server, (iii) data manage-ment system, and (iv) P2P network. The system was put totest using an astronomy application—helping researchersanalyze data from astronomy scientific articles. After thealgorithm converges, every node in the P2P network willhave the same learned classifier. The collaboratively learnedclassifier will also be sent to and stored in the backendsystem. Online document repositories can then utilize thisclassifier model for categorizing their own text data, by sub-mitting the text data to the system. More specifically, thesystem will apply the learned classifier on the submitted textdata to predict the class label for them. Then, the predictedresult will be sent back to the online document reposito-ries. The main characteristics of our proposed system areas follows:

• The classifier learning algorithm is designed tooperate on a P2P network, which provides theopportunity to take advantage of processing powerof web users.

• Enables collection and analysis of tags from webusers.

• An extensible system, implying that any distributeddata mining algorithm (such as outlier and noveltydetection) can also be easily supported by PADMINI.

• The distributed infrastructure allows prevention ofbottlenecks (such as traffic overload), single pointfailure, and helps save maintenance overhead.

• Effectively eliminates the need to manage tags; theresult of such collaborative efforts can be utilized bypublic as well.

This paper is organized as follows. Section 2 presentsthe motivation behind this work. Section 3 briefly describesthe related work. Section 4 describes the architecture of thePADMINI system. Section 5 describes the P2P text classi-fier learning application. Section 6 shows how the problemof classification can be posed as a linear programmingproblem and presents the distributed algorithm. Section 9provides experimental results and finally Section 10 con-cludes the paper.

2. MOTIVATION

In recent years, with the emergence of web 2.0 tech-nologies, users are actively participating in collaborativecommenting and tagging of web contents. Folksonomies[1], a system of classification constructed by collaboration

and interaction of users, became popular on the Web. How-ever, the current collaborative tagging solutions have thefollowing restrictions and drawbacks:

Mining of collaborative inputs not supported: Currently,more and more websites support web 2.0 technology,which allows users to add comments or tags to the webdocuments. Retrieving or mining information from thecollaborative inputs can help in improving web searchresult and classification accuracy. However, most onlinetext repositories do not offer any tools to perform datamining tasks on these collaborative inputs.

Results of collaborative inputs are not publicly accessi-ble: Even though a few Internet repositories provide suchdata mining tools on collaborative inputs, they generallydo not make the data mining result publicly accessible.For example, Google Image Labeler allows users to labelrandom images and help to improve the quality of searchresults. However, the wealth of information collected byGoogle Image Labeler through collaborative tagging isnot generally available to the public. Thus, the collabo-rative tagging effort is only beneficial to the Google ImageLabeler, and it cannot help any third party to reuse suchcollaborative inputs to refine their image search results.

This suggests the need to develop data mining algorithmsand systems to support collaborative tagging in analyzingonline data repositories while making sure that all theparticipants can benefit from the collective effort. Thispaper develops a distributed data mining technology-based approach that relies upon a P2P infrastructure foroffering a solution to this problem. The approach allowsall participants to introduce collaborative feedbacks evenif the data repository does not support such activities. Italso ensures that the outcome of the data mining process isavailable to the participants not just the owner of the dataserver.

Collaborative tagging for astronomy research: Data-intensive science and knowledge discovery from very largesky surveys is playing an increasingly important role intoday’s astronomy research. The astronomy communityhas access to huge multi-terabyte sky surveys (becomingpetabytes within the next few years) which have a tremen-dous potential for new discoveries. The primary rationalefor this is because mining of online research documentsenables astronomers to gain insightful and actionable infor-mation from massive collections of unstructured data. Asan illustration, we take the collection of science proposalabstracts of approved research projects for the HubbleSpace Telescope (HST)2. The scientific abstracts of allapproved HST observation proposals are public informa-tion and as such, they provide unique opportunities todiscover useful connections between astrophysical objects,

2 http://archive.stsci.edu/hst/

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 3

emerging astronomical research applications, hidden linksbetween different phenomena, associations among a varietyof research programs, and a network of knowledge conceptsacross the astronomical domain. To make ground-breakingdiscoveries in astronomy we need to engage end users incollaborative labeling and tagging of individual abstracts.This also motivates our work on developing a distributedP2P classifier learning system.

3. RELATED WORK

This section presents related literature on P2P data min-ing, collaborative tagging, and distributed linear program-ming.

P2P data mining: P2P systems utilize distributedresources to perform tasks collectively. They can help inperforming complex data mining tasks in a decentralizedand efficient fashion [2–4]. Various data mining algorithmshave been developed to work effectively on P2P networks,such as multivariate regression [5], decision tree induction[6], eigen monitoring [7], and classification [8].

There is a lot of work on distributed classificationalgorithms [9]. However, most of these existing algorithmsuse ensemble approach. Luo et al. [10] proposed anensemble-based distributed classification algorithm in P2Pnetworks, in which each peer builds its local classifier ona subset of data and the results from all local classifiersare then combined by plurality voting. Distributed boostingalgorithm [11] efficiently integrates local classifiers learnedover large distributed homogeneous databases. Tsoumakaset al. [12] proposed to construct a global predictive modelby stacking local classifiers.

The basic idea of distributed ensemble classification isthat each node in the network learns a classifier modelbased on local training data and the final classificationresult is obtained by decision aggregation, such as votingor averaging. The distributed P2P classifier learningalgorithm described in this paper is different from theabove ensemble-based approaches. Nodes in the networkcommunicate with each other to exchange intermediatealgorithm metadata in order to learn a global classifier.When algorithm converges, every node will have the sameclassifier model.

Data mining using mathematical programming: Mathe-matical programming has been used extensively in datamining for the purpose of feature selection and supervisedand unsupervised learning [13,14]. The feature selectionproblem can be framed as a mathematical program witha parametric objective function and linear constraints [15].The problem of clustering can be shown to be equivalentto a bilinear program [16]. Furthermore, the problem ofextracting a minimal number of data points from a large

data set to generate a support vector machine (SVM) clas-sifier [17], adding knowledge to SVMs [18], and developingnonlinear kernel-based separating surfaces [19] can all beposed as mathematical programming problems.

Distributed linear programming: One of the most popularalgorithms for linear programming is the simplex algorithm[20]. The main competitors of simplex are a group ofmethods known as interior point methods [21] and thereferences therein). These algorithms have been inspiredby Karmarkar’s algorithm [22]. As opposed to the simplexmethod, interior point methods reach the optimal vertex bytraversing the interior of the feasible region. Some interiorpoint methods have polynomial worst case running times,which are less than the exponential worst case runningtime of the simplex method. On average, however, thesimplex method is competitive with these methods. Severalother parallel implementations of the linear optimizationalgorithms exist—Stunkel and Reed [23] considered twodifferent approaches of parallelization of the constraintmatrices on the hypercube: (i) column partitioned simplexand (2) row partitioned simplex. Column partitionedsimplex has been studied further in work done by Yarmish[24]. Their parallel algorithm divides the columns of theconstraint matrix among many processors. Ho and Sundaraj[25] compared the problem of distributed computationof the simplex method using two different methods: (i)distributed reinversion (DINV) and (ii) distributing pricing(DPRI).

Eckstein et al. [26] presented parallel implementations ofthe simplex algorithm using computational devices called‘stripe arrays’ which resemble temporary data structuresused in some routines of connection machine scientificsoftware library (CMSSL).

In all of the abovementioned techniques, the stress is onsplitting a given problem on different machines to reach thesolution faster by utilizing techniques for parallelization.Ours is an inherently distributed problem. Furthermore, tothe best of our knowledge, the above parallel optimizationalgorithms have not been adapted for use in data miningalgorithms.

Several research papers have been published on dis-tributed SVM (DSVM). However, these DSVM algorithmsrequire a centralized server, and some of them makeassumptions on network topology. The first DSVM algo-rithm is proposed by Syed et al. in 1999 [27]. Their algo-rithm applies the standard SVM algorithm to each local datasource, and the resulting Support Vectors (SVs) are sent toa central server where the algorithm is reapplied on theSVs. However, a global optimal solution cannot be guaran-teed. Caragea et al. [28] improved this algorithm by sendingthe global SVs back to the distributed nodes and itera-tively modifying the SVs until global optimum is obtained.However, this algorithm requires large communication cost.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: Peer-to-peer distributed text classifier learning in PADMINI

4 Statistical Analysis and Data Mining, Vol. (In press)

Cascade SVM [29] is a fast algorithm and can achieveglobal optimum. The network topology is taken into con-sideration to accelerate the distributed training process.However, it is unknown whether the DSVM algorithm canconverge in a more general network. GADGET SVM [30]solves for a linear DSVM using a gossip-based subgradientsolver. DPSVM [31] works on a strongly connected net-work. It can converge to a globally optimal classifier basedon updating the local solutions iteratively. In contrast, ouralgorithm neither requires a centralized server nor depen-dents on network topology.

Collaborative tagging: Many web sites have started sup-porting collaborative tagging, such as labeling shared pho-tographs (www.flickr.com), videos (www.youtube.com),and academic publications (www.citeulike.org). Collabo-rative tagging system allows users to add semanticallymeaningful information in the form of tags to the shareddocuments. This tagging information has been applied toimprove web search result [32], recommendation systemquality [33], and clustering accuracy [34,35]. Golder andHuberman [36] study user activity patterns regarding sys-tem utilization and tag usage in del.icio.us. Niwa et al. [37]propose a recommendation system in the context of socialbookmarking. They utilize Folksonomy tags to classify webpages and to express users’ preference. Tso-Sutter et al.[33] proposed a tag-aware recommender system based onusers’ feedback. They incorporated tags into collaborativefiltering algorithms to enhance recommendation accuracy.Sihem et al. [32] first attempted to incorporate social behav-ior into web search. They argued that search results can beimproved on tagging sites. Tags together with features fromthe document could yield better classification accuracy thanany of them alone. Brooks et al. [34] clustered blog articlesthat shared the same tags, and analyzed the effectiveness oftags for blog classification. Based on the co-occurrence ofmultiple tags, the shared interests of users can be retrievedand the clustered articles can yield more accuracy [35].

However, collaborative tagging systems have some dis-advantages including the absence of standard keywords anderrors in tagging due to spelling errors. As tagging com-munities grow, the added content and the metadata becomeharder to manage due to increased content diversity. In thispaper, the user can only choose from predefined tags tolabel the documents; however, more tags can be supportedby the system on user demand.

In our previous work [8], we describe a P2P classifierlearning system ‘TagLearner’. This work proposes asystem prototype for the P2P classifier learning applicationcomposed of a service-provider and client-side browserplugin. This paper presents a significantly improved versionof the PADMINI system architecture; it also describesthe deployed system with the implementation of the P2Pclassifier learning algorithm. The PADMINI system that

we propose and implement is highly extensible to otherdistributed data mining algorithms3

4. THE PADMINI SYSTEM ARCHITECTUREOVERVIEW

The goal of the PADMINI system is to help amateurand experienced astronomy researchers do data analysis onlarge repositories available through the virtual observatories[39,40]. The system aims at being a rich source of algorithmimplementations to aid analytical tasks on the archive.Hence, scalability and extensibility have been the focalpoints in the design of our system. It mainly contains fourcomponents: Web server, distributed data mining server,database, and P2P network. The following subsectionsdescribe the role of these major system components indetail.

4.1. Web Server

The web server hosts the web-based interface for thePADMINI system. Figure 1 shows the home page of theweb-based interface. Apache Tomcat4 is used as the webserver for hosting the system.

4.2. Distributed Data Mining Server

The distributed data mining server (DDM server) formsan intermediate tier between the web server and thecomputation network. It accepts and fulfills job requestsfrom the web server. Depending on the availability ofthe resources in the computation network, a job iseither submitted for execution or stored in a queue. TheDDM server currently supports First-Come-First-Servedscheduling.

The P2P computation network supports two disparatedistributed programming frameworks, namely, Hadoop anddistributed data mining toolkit (DDMT). The DDMT pro-vides a framework for implementing highly asynchronousdistributed algorithms. In this paper, we focus on theDDMT framework and the P2P classifier learning algorithmimplemented on that framework.

4.3. Database

The Database component includes the server databaseand job database used to store the user and job information,

3 Recent work also extends the system to an outlier detectionalgorithm [38].

4 http://tomcat.apache.org

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 5

Fig. 1 Home page of the PADMINI system. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

respectively. The server database stores the informationrelated to the users, the jobs submitted by them and theresults of the most recent jobs. The information relatedto the algorithms supported by the system also resideshere. The job database stores the information related to thebackend network and also maintains the queues of the jobsthat are submitted and the status of those jobs. The resultsof the completed jobs are related to the user who submittedthe job. Hence, they are not stored in this database and arestored in the server database instead.

4.4. P2P network

The P2P network forms the backbone of the computationnetwork. This network supports two frameworks, namely,Hadoop5 and the DDMT. DDMT 6 is a framework forwriting event-driven distributed algorithms, written in Javaand built on top of the Java Agent Development (JADE7)framework. Any node, which has DDMT installed, can jointhe P2P network and then become a computation node.

The following sections describe the architecture of thePADMINI system in more detail by focusing on the designand implementation of the P2P text classifier system.

5 http://hadoop.apache.org/6 http://www.umbc.edu/ddm/wiki/software/DDMT7 http://jade.tilab.com/

Section 5 gives an overview of this system while Sections5.1 and 5.2 describe how the classifier learning system isintegrated into the PADMINI system.

5. P2P TEXT CLASSIFIER LEARNINGAPPLICATION

The P2P text classifier learning system helps the user tagthe online documents using a Firefox plugin. A classifieris then learned collaboratively among the web users andcomputation nodes in the system using the training datagenerated from the user tagged text. The learned classifieris then made accessible to the user from the web interfacefor PADMINI.

The architecture of the classifier learning system is shownin Figure 3. There are two main steps involved in using thePADMINI system:

• Collaborative tagging and classifier learningBy following the important steps below, web userscan label the online documents and collaborativelyrun the P2P classifier learning algorithm. Some ofthe steps will be discussed in more detail in Sections5.1 and 5.2.

1. User joins the Google/Yahoo discussion groupsto determine the feature sets and labels used

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: Peer-to-peer distributed text classifier learning in PADMINI

6 Statistical Analysis and Data Mining, Vol. (In press)

Fig. 2 Peer-to-Peer text classifier learning web interface. [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

for classification. The links to the discussiongroups are provided on the web interface. Thesignificance of the discussion groups will befurther discussed in Section 5.1.

2. User downloads and installs the Firefox pluginprovided on the website. The plugin is themain tool for user to label text data and connectto the server. The detailed functioning of theFirefox plugin are described in Section 5.2.

3. User logs in through the plugin and chooses tojoin a tagging group.

4. User uses the tags that are currently supportedby the tagging group to label the text. Theuser can also suggest new tags to the tagginggroup, which can be added to the group afterapproval.

5. User submits the labeled text to the web server.

6. If the user is willing, he can also becomea part of the computation network by down-loading another tool and offering his systemresources for the classifier computation. Thisis explained in Section 5.2.

7. When sufficient feature vectors are submittedby users, the server will trigger the algorithm.The learned classifier will be stored in theserver database of the system.

8. User can also submit any arbitrary text to theweb site and check its category as predictedby the learned classifiers that are present inthe database (see below).

• Using the learned classifiers to classify new docu-ments: Any web user can send a request to the webserver asking for classification service. The classifi-cation service can be provided immediately, if theserver database already has the learnt classifier. Theweb site currently provides an interface where a usercan enter text and check what label the system assignsto that text using the learnt classifiers.

5.1. Web Interface

Figure 2 shows the P2P text classifier learning webinterface. The interface allows users to join in theP2P text classifier learning task, check job status, andtest the learnt classifiers. The four steps seen on the

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 7

Fig. 3 Peer-to-Peer classifier learning system architecture. [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

interface of the P2P text classification page are describedbelow:

1. Join online discussion group: Users can join anexisting Google or Yahoo groups of his interestor suggest a new group if one of his interestdoes not exist. These groups primarily facilitatediscussions to determine the features and labels usedfor classification. Moreover, the discussion groupcan effectively avoid restricting the users’ choice oflabels and feature sets. Any user can propose his/hersuggestion. If a majority of users demand support ofcertain features and labels, they will be added by thesystem administrator to the server database.

The main reason of having such a discussion groupis that Vector Space Model of text representationis widely used in text classification algorithms, inwhich documents are represented as a vector in am-dimensional space. Thus, document classificationincludes a step of text feature extraction to determinea set of words or terms that frequently occur indocuments (keywords). In a distributed environment,it is not realistic to generate features based oncollaboratively generated annotations. Thus, there

needs be a platform, on which users can discuss thefeature set and eventually reach an agreement. In thisway, the text data representation is generated basedon the user’s domain knowledge. This feature set willbe sent to the user when they login, so that every userin the group will use the same feature set to convertdocuments into a feature vector.

Another reason is that the typical collaborativetagging system allows the user to assign self-definedtags to resources, which increases the difficulty oftag management, e.g. absence of standard keywordsand spelling errors. The discussion group provides aplatform for the group users to discuss and determinethe tags that will be used for labeling.

Take astronomy discussion group as an example. Auser wants to group some people together to workon learning a binary classifier model for astronomyabstracts. The classifier will be able to classifyastronomy abstracts on ‘Galaxy’ and ‘Hot Star’. Heraise two questions in the group: (i) Please suggestsuitable class labels. (ii) Please suggest possiblefeatures that can be used for classification. Anyone inthe group can participate in the discussion. They will

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: Peer-to-peer distributed text classifier learning in PADMINI

8 Statistical Analysis and Data Mining, Vol. (In press)

suggest features based on their domain knowledge.Finally, they reach an agreement on using ‘Galaxies’and ‘Hot Stars’ as class labels. Assume m = 6, theyagree on using three features for ‘Galaxies’ category:clusters, galaxy, and formation; and three featuresfor ‘Hot Stars’ category: white, dwarf, and pulsar.The group creator will then send the class labelsand the six selected features to the system. Thefeatures and labels will then be added by the systemadministrator to the server database. These classlabels and features will be sent to the users when theyinstall the Firefox plugin and login. Thus, users canonly select predefined tags to label the documentswhich eliminate tag management issue. Moreover,they will be using the same set of features to createvector representation of the text.

2. Download and install Firefox plugin: With theFirefox plugin which can be downloaded from thePADMINI website, tagging text and submitting ajob for classification takes just a few clicks. Thefunctionality of the Firefox plugin will be describedin Section 5.2.

3. Check job status: Users can easily check the statusof the job that they have submitted. Users cancontinue to tag new text documents without waitingfor the previous job to complete. However, theycannot submit new jobs before the previous job iscompleted. Once a job is completed, the participatingusers get a notification on their profile pagesaccordingly.

4. Test classification: This function provides the userswith a way to check the accuracy or quality of thelearned classifier. By submitting a text documentto the system and selecting which tagging groupit belongs to, the class label can be automaticlypredicted by the system. User can verify the accuracyof the predicted class label by comparing withuser manually defined class label. Batch modeclassification function will be provided in the future,

so that multiple text documents can be submitted andclassified at one time.

5.2. Firefox Plugin for Text Tagging andClassification

The Firefox plugin (Figure 4) is the main interface thatconnects users with the system. The aim of developing theFirefox plugin is twofold:

• Provide an easy way for user to label text andcollaboratively run the classification algorithm.

• Enable users in the same tagging group to tag the textdocuments using the same set of labels. Convert thetext documents into feature vectors using the samefeature set.

Traditionally, collaborative tagging systems allow theuser to assign any arbitrary tags to resources due to the lackof a centrally controlled vocabulary of tags. For example,ambiguity in the meaning of tags, the use of synonymscreates informational redundancy, and spelling errors, etc.This increases the complexity of tag management. In oursystem, once the user logs in through the Firefox plugin andjoins any tagging group, the class labels and feature sets inthe server database will be sent to the plugin. The user cantag the documents using only those predefined labels, whichavoids the tag management issue to some degree. Whenthe user highlights some text and clicks on the Tag button,the plugin will convert the text into a feature vector. Thisfeature vector and the corresponding class label are storedin the users local machine as the training data.

Figure 5 diagrammatically explains how the pluginis used. When the user clicks on the Submit button,the classifier learning algorithm will be started by theDDM server when the total number of training datasubmitted crosses a threshold. To be more specific, the webserver maintains a threshold value. The threshold value isdetermined empirically. When the total number of trainingdata submitted reaches the threshold, the web server willredirect the job request to the DDM server. DDM server

Fig. 4 Firefox plugin—screen shot. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 9: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 9

Fig. 5 Flow diagram for the Firefox plugin. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

will trigger the algorithm to start. A user who downloadsthe plugin also has an option of downloading the DDMT.During its installation, the DDMT downloaded on the user’smachine registers with the DDMT server to become a partof the computation framework. If a user has downloadedand installed the DDMT, then only the count of featurevectors is sent to the DDM Server, and the feature vectorsare stored locally and accessed when the classifier learningalgorithm runs. The support for nodes leaving and enteringthe network dynamically has been built into the DDMT.The user who chooses not to install the DDMT will onlyparticipate in the tagging part. When a job is submitted,the feature vectors in this case are sent to the server andredistributed onto the computation nodes.

In the following section, we will show how the classifi-cation task can be posed as a linear programming problemand solved in network without constraint centralization.

6. DISTRIBUTED P2P CLASSIFIER LEARNING

With the help of Firefox Plugin, the tagged text datacan be converted into feature vectors along with the classlabels that tagged by the user. This information is storedin the users local machine as training data set. The goal ofsupervised learning for classification is to use the training

data set to train a classifier model that can predict classlabel of unseen instances with high accuracy. In the P2Pnetwork, the nodes (web users) have their own local trainingdata set. In order to learn a global classifier based on allthe training data on every node, one possible but not awise way is to transfer all the training data to a centralserver and let the server to learn the classifier. One reasonis that the communication cost for transferring the data maybe substantially high. More importantly, the system maybecome less scalable if we are relying too much on theserver to do all the computation. Moreover, if we createanother central server, then we are relying upon anothersingle entity who could behave the same way as the originaldata server owner. From the business structure perspective,we do not want to rely upon a single entity.

Hence, a distributed P2P classifier learning algorithm[8] is proposed to learn the classifier without central-izing the training data. This algorithm is a distributedlinear programming-based classifier learning algorithm.The nodes in the network will communicate and col-laboratively learn the global classifier. When the algo-rithm converges, every node will have the same learnedglobal classifier. In Section 7, we describe how classifi-cation problem is framed as a linear programming prob-lem. And in Section 8, we briefly describe the distributedalgorithm.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 10: Peer-to-peer distributed text classifier learning in PADMINI

10 Statistical Analysis and Data Mining, Vol. (In press)

7. CLASSIFIER DESIGN BY LINEARPROGRAMMING

Let T be a data set with P instances. Each instancecontains a vector of N features (denoted by x) and acategorical or continuous valued target variable (denotedby y). It is assumed that there exists an underlying functionf such that y = f(x) for each instance (x,y) in the trainingset. The goal of a supervised learning algorithm is to findan approximation H of f that can be used to predict valuesof unseen instances in the test set.

In order to find an approximation, one possibility is toobtain a weighting of the feature vector which is positivefor patterns in the positive class and negative for patterns inthe other class (assuming a binary classification problem)[14]. Thus, if the kth instance is represented by xk =[x1k, x2k, . . . , xNk], k = 1, 2 . . . , P , and W is the weightvector of N weights represented by W = [w1, w2, . . . wN ]T,then we are interested in a W such that

xkW ≥ 0 (1)

for all instances in the positive category and negative forinstances in the other class. Let ek represent the errorassociated with instance xk , πk represent the weightingcoefficient. Then total error over P instances can beexpressed as:

e = �Pk=1πkek (2)

and error function for a given instance is obtained byestimating

ek = −(xkW − d), when xkW < d

= 0, when xkW ≥ d(3)

In order to find the desired W , the error function e shouldbe minimized. Smith [14] shows that the above can beformulated as a linear programming problem, with theconstraint matrix as follows:

χW + E = D + S (4)

where χ = [x1x2 . . . xP ]T where T represents the trans-pose, E = (e1, e2, . . . , eP )T is the error vector, D =(d, d, . . . , d)T, and S = (s1, s2, . . . , sP ) is the vector ofslack variables which allow the inequalities to be madeequalities. The objective function of the linear program canbe written as:

e = �TE (5)

where � = (π1, π2, . . . , πP )T. To minimize e, it is conve-nient to transform the minimization to a equivalent maxi-mization problem: z = �TD − �TE which can be further

Fig. 6 An illustrative example. [Color figure can be viewed inthe online issue, which is available at wileyonlinelibrary.com.]

simplified as z = �TχW − �TS. The above linear programcan be solved by using the Simplex Algorithm8 [20]. Notethat in this paper we are primarily concerned with problemsthat are linearly separable. Extension of the distributed algo-rithm for quadratic programming problems is left for futureresearch.

Example 1: Consider the network with four nodes shownin Figure 6. Each node contains three attributes denotedby x1, x2, x3 and a class label, L. If all the data werecentralized, for this network,

χ =

⎡⎢⎢⎢⎢⎣

2 1 71 3 31 4 21 1 32 7 6.5

⎤⎥⎥⎥⎥⎦

,

W = [w1 w2 w3]T, D = [0.5 0.5 0.5 0.5 0.5]considering a binary classification problem with threshold-ing at 0.5, S = [0 0 0 0 0], � = [1 1 1 1 1]and E = [0 0 0 0 0]. Thus, the constraints in thenetwork are 2w1 + w2 + 7w3 = 0.5;w1 + 3w2 + 3w3 =0.5;w1 + 4w2 + 2w3 = 0.5;w1 + w2 + 3w3 = 0.5; 2w1+7w2 + 6.5w3 = 0.5; and the objective function is 7w1 +16w2 + 21.5w3 = z.

However, if this linear program needs to be solved ina distributed P2P environment, it is not worthwhile totransfer all the data to a central site and run the optimizationsince the communication cost incurred may be substantiallyhigh. In the next section, we present a distributed simplexalgorithm that can solve the optimization problem innetwork.

8 We refrain from giving an extensive description of the simplexalgorithm due to space restrictions.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 11: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 11

8. THE DISTRIBUTED LINEAR PROGRAMMINGALGORITHM

In a P2P network, typically the nodes have their ownlocal data sets comprising of the feature vectors andclass labels. From these, local constraint matrices can beconstructed as described in Section 7. However, solvingthe objective function based on local constraints does notensure that constraints at other nodes are satisfied. Thus,nodes in the network need to communicate with one anotherin order to ensure that the solution to the objective functiontakes into consideration all the constraints in the network.Our distributed linear classification algorithm has two mainsteps: (i) a preprocessing step for obtaining the canonicalrepresentation of a linear system and (ii) obtaining thesolution to the objective function.

8.1. Distributed Canonical Representation of theLinear System

Distributed canonical representation of the linear sys-tem: An important preprocessing step before solving a dis-tributed linear program is the development of an algorithmfor obtaining the canonical representation. In order to do so,each node in the network needs to have access to the num-ber of basic variables [20] that it should add which is equalto the total number of constraints in the system. We pro-pose the following converge cast-based approach: Let s bean initiator node which builds a minimum spanning tree onall nodes in the network. Following this, a message is sentby s to all its neighbors asking how many local constraintseach node has. A neighbor on receiving this message, eitherforwards it to its neighbors (if there are any) or sends backa reply. At the end of this procedure, node s has the correctvalue of the total number of constraints in the system, sayTc. Next, node s sets a variable count − constraint to thenumber of its local constraints. It traverses the minimumspanning tree and informs each node visited of the num-ber of constraints seen so far. Let T represent the value ofcount − constraint at node i. Then node i must add Tc

basic variables to each of its constraints. At the end of thisprocedure, all nodes have added the relevant basic variables.Note that this procedure creates exactly the same canonicalform as would have been obtained if all the constraints werecentralized. It must be noted that the distributed canonicalrepresentation algorithm needs to be run only once at thetime of initialization. Thereafter, each node just updates itstableau [20] depending on the pivots chosen at that roundof iteration.

Once each of the nodes has the canonical representation,we are ready to describe the distributed simplex optimiza-tion algorithm. We assume that nodes maintain only theirlocal simplex tableau and the global objective function.

The goal is to obtain a solution to the global optimizationproblem.

8.2. Notation and Preliminaries

Notation and preliminaries: Let P1, P2, . . . , Pη be a setof nodes connected to one another via an underlyingcommunication tree such that each node Pi knows itsneighbors Ni . Each node Pi has its own local constraintswhich may change from time to time depending on theresources available at that node. The constraints at node i

have the form AiXi = bi where Ai represents an m × n

matrix, Xi is a n × 1 vector and bi is a m × 1 vector.Thus, at each node, we are interested in solving thefollowing linear programming problem: Find Xi ≥ 0 andMin zi satisfying c1x1 + c2x2 + . . . , cnxn = zi subject tothe constraints AiXi = bi . The global linear program (ifall the constraint matrices could be centralized) can bewritten as follows: find X ≥ 0 and Min z satisfying c1x1 +c2x2 + . . . , cnxn = z subject to constraints AX = B whereA = ⋃η

i=1 Ai and B = ⋃η

i=1 bi .Next, we present an exact algorithm for solving linear

optimization using the simplex method. Our assumptionis that each node contains different sets of constraints,but has knowledge of the global objective function. Thealgorithm consists of two independent components: Eachnode estimates the column pivot and identifies the rowpivot based on its own local tableau. It then participatesin a distributed constraint sharing protocol. The nodecommunicates with only its neighbors to obtain that rowwhich has the minimum of min-vals amongst its neighbors.On obtaining the desired row from its neighbor, a nodeupdates its simplex tableau and performs Gauss Jordanelimination.

8.3. The Algorithm

The Algorithm: At the beginning of iteration l, anode Pi has its own constraint matrix and the objectivefunction. The column pivot, henceforth referred to ascol − pivoti , is that column of the tableau correspondingto the most negative indicator9 of c1, c2, . . . , cn. Followingthis, each node forms the row ratios (ri

j , 1 ≤ j ≤ m) foreach row, i.e. it divides bi

j , 1 ≤ j ≤ m by the correspondingnumber in the pivot column of that row. Let minimumof ri

j ’s be presented as rowpivoti . This is stored inthe history table of node Pi corresponding to iterationl. Now the node must participate in the distributedalgorithm for determination of the minimum row ratio,

9 Note that if no negative indicator is found, then this is thefinal simplex tableau.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 12: Peer-to-peer distributed text classifier learning in PADMINI

12 Statistical Analysis and Data Mining, Vol. (In press)

i.e., Minimum (rowpivoti), i ∈ Ni. We describe a simpleprotocol called Push-Min for computing this. At all times t,each node maintains a minimum mt,i . At time t = 0, mt,i =rowpivoti . Thereafter, each node follows the protocolgiven in Algorithm 1. When the protocol Push-Minterminates, each node will have the exact value of theminimum rowpivoti in the network [41].

Algorithm 1 Protocol Push-Min1. Let {mr } be all the values sent to i at round t − 1.2. Let mt,i = min({mr }, rowpivoti).3. Send mt,i to all the neighbors.4. mt,i is the estimate of the minimum in step t.

Once the Push-Min protocol converges, the node contain-ing the minimum rowpivoti (say Pmin) will send its row inthe simplex tableau to all other nodes in the network. Next,node Pi updates its local tableau with respect to the extrarow it received from node Pmin. The algorithm, constraintsharing (CS) protocol is described in Table 2. Completionof one round of the CS protocol ensures that one iterationof the distributed simplex algorithm is over.

Termination: In a termination state, two things shouldhappen: (i) no more messages traverse in the network and(ii) each local node has all its ci > 0. Thus, the state ofthe network can be described by information possessed byeach node. In particular, each node will have a solution tothe linear programming problem. Note that this solutionconverges exactly to the solution if all the constraintswere centralized. If the distributed system were dynamicwith nodes joining and leaving on an ad hoc basis, theconstraint matrix for nodes currently in the network willchange and so will the objective function to be solved. Weare currently investigating approaches to solve this problemusing dynamic linear programming approaches.

Algorithm 2 Constraint sharing protocol (CS protocol)1. Node Pi performs protocol Push-Min until there are no moremessages passed.2. On convergence to the exact minimum, the minimumrowpivoti is known to all nodes in the grid.3. All the nodes use the row obtained in Step 2 to performGauss Jordan elimination on the local tableau.4. At the end of Step 3, each node locally has the updatedtableau and completes the current iteration of the simplexalgorithm.

8.4. Analysis of Protocol Push-Min

The Protocol Push-Min behaves similar to the spreadof an epidemic in a large population. Consequently, ouranalysis is based on statistical modeling of epidemics[42,43].

Definitions: A node is called susceptible, if it does nothave the exact minimum, but is capable of obtaining itby communication with its immediate neighbors. If a nodereceives a rowpivot value less than its current value, itbecomes infected and willing to share this informationwith other neighbors. When a node unnecessarily contactsanother node which also has the same information, thereis no extra information gained by this communication. Thenode already having the information is called a dead orimmune node. Let xt represent the number of susceptiblenodes, yt the number of infected ones and zt the dead orimmune nodes. Then,

xt + yt + zt = η. (6)

Let β be the infection parameter defined as the proportionof contacts between infective and susceptible per unit time;γ is the removal parameter defined as the proportionof infective per unit time removed from the population.We also define ρ to be the Relative Removal Rate,i.e. ρ = γ

β. The spread of the minimum value amongst

nodes can be represented by the following differenceequations: (1) xt+1 = xt − βxtyt , (2) yt+1 = yt + βxtyt −γyt and (3) zt+1 = zt + γyt . Next, we illustrate the factthat under the Protocol Push-Min the entire network isinfected exponentially fast. This also implies that onconvergence, all nodes in the network have the sameminimal value.

LEMMA 1 Under the Protocol Push-Min, the number ofsusceptible nodes in the network decreases exponentially.

Proof: Let xt and x0 represent the number of sus-ceptible nodes in the network at time t and t = 0respectively. Then we have, xt+1

xt= 1 − βyt = 1 − zt+1−zt

ρ.

Thus, the recursive equation for xt can be written asfollows: xt = x0

∏t−1j=0 1 − zj+1−zj

ρ, xt

x0= (1 − z1−z0

ρ)(1 −

z2−z1ρ

) · · · (1 − zt−zt−1ρ

). Taking logarithms on both sides,it can be shown that lg xt

x0≤ (− z1−z0

ρ) + (− z2−z1

ρ) + · · · +

(− zt−zt−1ρ

) ≤ −∑t−1j=0

zj+1−zj

ρHence, xt = x0 exp(−∑t−1

j=0zj+1−zj

ρ). �

8.5. Convergence of the Distributed SimplexAlgorithm

In the simplex algorithm, the canonical form providesan immediate criteria for testing the optimality of a basicfeasible solution. If the criterion is not satisfied, anotheriteration of the simplex algorithm is initiated. Formally wecan state the following theorem:

THEOREM 1: [20] Given a linear program presentedin feasible canonical form, there exists a finite sequence

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 13: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 13

of pivot operations each yielding a basic feasible solutionsuch that the final canonical form yields an optimal basicfeasible solution, or an infinite class of feasible solutionsfor which the values of zhave no lower bound.

The proof of Theorem 1 is given in Chapter 6 ofref [20]. �

In order to prove that the distributed algorithm indeedconverges, we prove the following theorem:

THEOREM 2: Assume that the linear constraints ateach node can be centralized and a feasible canonical formcan be generated. If there exists a finite sequence of pivotoperations each yielding a basic feasible solution such thatthe final canonical form yields an optimal basic feasiblesolution for the centralized scenario, then such a finitesequence of pivot operations also exists for the distributedalgorithm.

Proof: First note that if the linear constraints at each nodewere centralized and the objective function was solvedusing these constraints, the Simplex Algorithm wouldterminate (Theorem 1). In the distributed scenario, thefirst step is to generate a canonical representation of theconstraints at each node. Note that on completion of thisinitiation step, both the centralized and the distributedalgorithms have the exact same canonical representations.The distributed algorithm now obtains the column pivotand row pivot. The row pivot obtained after the ProtocolPush-Min is executed yields the minimum in the entirenetwork. Thus, this is identical to the row pivot that wouldbe obtained in each iteration of the centralized simplexalgorithm. This implies that in each iteration, the same rowpivot and column pivot are seen both in the centralized anddistributed algorithms. This completes the proof. �

8.6. Communication Cost Analysis

The communication cost of the distributed algorithmcomes from two parts: (i) the amount of communicationrequired by the Protocol Push-Min until it converges and(ii) the number of iterations for the simplex algorithm toterminate. In the worst case, Protocol Push-Min may needto communicate with all the nodes in the network. Thismeans that the worst case communication cost for Push-Min in an iteration of simplex is O (η) (where η is the totalnumber of nodes in the network). Also, in the worst case,the number of pivots needed by the simplex algorithm is(

nm

). Thus, in the worst case, the communication cost of

this distributed algorithm can be exponential. However, inmost practical cases, the simplex algorithm converges in λm

[20] iterations where λ < 4 typically. This means that formost practical cases, communication complexity is at most

O (λmη). Note that centralization of data would require O(m n) communication. Thus, if η < n

λsignificant benefits10

may be obtained from this distributed algorithm.

9. EXPERIMENTAL RESULTS

This section presents the experimental results for measur-ing the performance of the P2P classifier learning algorithmin PADMINI system. The P2P classifier learning algorithmis implemented on the DDMT framework. We assume that anumber of users have already installed the plugin, joined thesame tagging group, tagged some text documents and sub-mit the job. The tagged text documents have been convertedto feature vectors by the plugin. There is enough train-ing data (see Section 5.2) submitted for the DDM serverto trigger the algorithm to start. Since to the best of ourknowledge no prior work exists for solving linear program-ming using local, asynchronous, distributed algorithms inP2P environments, we are unable to compare this work withother distributed algorithms.

9.1. Experiment Setup

As discussed in Section 2, PADMINI system can be usedby astronomers to help them discover useful informationfrom a large amount of unstructured text data. For thisexperiment, we use the Proposal Abstracts Catalog for theHST11. This catalog consists of over 8000 abstracts indifferent categories such as solar system, stars, quasars,and unresolved stellar populations. For each proposal,the catalog lists the proposal type, science category, ID,title, principal investigator (PI), and the abstract. In ourexperiment, we only consider abstracts that belong to thecategory Hot Stars and Galaxies. Abstracts with sciencecategory ‘Hot Stars and Stellar Corpses’ or ‘Hot Stars andBlue Stragglers’ are classified into the category Hot Stars.Likewise, all abstracts related with galaxies are included inthe category Galaxies.

The distributed environment is simulated using theDDMT. With three physical nodes in the network, thenetwork overlay topologies are generated using BRITE12

and each node emulates almost equal number of nodes fromthe topology. The Waxman model is used for generatingthe network structure, i.e. the probability that nodes u

and v have an edge between them is given by theformula P(u, v) = αe−d/(βL) where 0 < α, β ≤ 1, d is

10 Note that the value of n here takes into account all the basicvariables per constraint. Thus, this is larger than the number ofvariables in each constraint equation.

11 http://archive.stsci.edu/hst/proposal_abstracts.html12 http://www.cs.bu.edu/brite

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 14: Peer-to-peer distributed text classifier learning in PADMINI

14 Statistical Analysis and Data Mining, Vol. (In press)

the Euclidean distance from node u to node v, and L

is the maximum distance between any two nodes. Eachpeer in the network can tag a portion of the abstractlocally.

9.2. Results and Discussion

The performance is evaluated from the following fouraspects:

9.2.1. Communication cost compared to the centralizedversion

After tagging, every user has training data stored inhis/her local machine. Distributed classifier learning algo-rithm allows users to learn the classifier collaborativelywithout centralizing the training data. However, the cen-tralized version of the algorithm requires all the trainingdata to be transferred to the server. The amount of datatransferred in bytes was calculated for both the central-ized13 and distributed scenarios. Let c be the total numberof constraints, v be the number of variables per constraintequation and assume that each real number is representedby 4 bytes. The amount of data that needs to be transferredfor centralization is then calculated as c ∗ (v + 1) ∗ 4. How-ever, in the distributed case, both messages transferred forestimating the pivot and the final pivot row at the end of aniteration of simplex algorithm should be considered, i.e. ifμ number of messages are exchanged to estimate the rowpivot, then the total communication cost per simplex iter-ation will be 4 ∗ (μ + v + 1). We first vary the number ofnodes in the network between 10 and 120 for HST data set,keeping the number of constraint equations and variablesfixed at 10. Figures 7 shows that the distributed algorithmhas less communication cost compared to the centralizedversion.

9.2.2. Classification accuracy

For estimating the classification error, we divide theHST data set into training set(Ttrain) and testing set(Ttest ).We first use the training set to train the classifier model,which is essentially the global weight vector W . Then, wecompute Ttest ∗ W and threshold values greater than 0.5 getclass label 1 and less than 0.5 gets class label 0. Table 1reports the actual classification error obtained in networkfor different network sizes, using different training and testset sizes. From Table 1, we find that the larger the trainingset size, the higher the classifier accuracy. Note that thisdistributed classification algorithm is an exact algorithm and

13 Note that this is a hypothetical scenario

10 30 60 80 1200

10,000

20,000

30,000

40,000

50,000

60,000

The number of nodes in the network

Co

mm

un

icat

ion

co

st (

byt

es) Centralized

Distributed

Fig. 7 Communication cost versus the number of nodes in thenetwork. [Color figure can be viewed in the online issue, whichis available at wileyonlinelibrary.com.]

6 40 100 1200

2,000

4,000

6,000

8,000

10,000

12,000

The number of nodes in the network

Co

mm

un

icat

ion

Co

st (

byt

es) Same total number of constraints

Same number of constraints per node

Fig. 8 Performance affected by fragmenting the training data.[Color figure can be viewed in the online issue, which is availableat wileyonlinelibrary.com.]

Table 1. Classification errors obtained in P2P networks ofvarying sizes for the HST Data.

Network size Train set size Test set size Error%

30 200 100 2230 300 150 2140 400 200 1650 500 250 12.3

hence the classification accuracies for both centralized anddistributed versions are same. Furthermore, our goal hereis not to find the best classifier possible, hence comparisonwith standard centralized techniques such as SVM anddecision trees is not meaningful.

9.2.3. Effect of varying the number of variables andconstraints per node in the network

We set the network size as 40 nodes and vary the numberof edges between 40 and 160.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 15: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 15

4 8 12 16 200

1000

2000

3000

4000

5000

(a)

(b)

6000

Number of constraints per node

Com

mun

icat

ion

cost

(by

tes)

10 20 30 400

1000

2000

3000

4000

5000

6000

Number of variables per constraint

Com

mun

icat

ion

cost

(by

tes)

4080120160

4080120160

Fig. 9 Communication cost of P2P Classifier learning algorithm:(a) the number of constraints per node (b) versus the number ofvariables per constraint. [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

1. Effect of varying the number of constraints per node:Figure 9a shows the effect of varying the number ofconstraints per node. As the number of constraints ata node increase, only the local computation at eachnode to find the row pivot increases. This does notaffect the number of messages being passed in thenetwork, due to which the communication cost curvehas a relatively flat nature.

2. Effect of varying the number of variables per con-straint : Figure 9b shows the effect of variation of thenumber of variables per constraint. With the increas-ing number of variables, the slight increase in thegraph can be attributed to the larger size of the rowtransmitted at the end of the simplex iteration.

9.2.4. Effect of varying the number of nodes in thenetwork

Figure 8, we vary the number of nodes in the networkbetween 40 and 160. First, we keep the total amount oftraining data the same, so that the number of constraints on

each node varies. As the number of nodes in the networkincreases, the amount of data distributed onto each nodeincreases. From Figure 8, the communication cost increaseswith the increasing number of nodes. Second, we maintainthe same degree of data fragmentation, that is we set thenumber of constraints per node the same (10 constraintper node). Figure 8 shows the similar communication cost.Hence, the communication cost of the algorithm is mainlydependent on the number of nodes in the network, but noton the number of constraints per node.

9.2.5. Scalability

To measure the scalability of the PADMINI system, wevary the number of nodes in the network with each nodehaving almost the same number of constraints and measurethe response time of the system. Figure 10a shows thisvariation in the response time. The response time of thesystem is linearly proportional to the size of the network.We also test the scalability of the system with respect toincreasing number of constraints on each peer. Figure 10bshows how the response time changes as the data to beprocessed at each peer varies. In this experiment, we varythe number of constraints at each node and check theresponse time on a 50 node network, once with 50 edgesand then with 100 edges. As the number of edges increase,the message passing between the neighbors increases, henceslightly higher response time is observed for the networkwith 100 edges. Though the amount of data processedat a peer becomes larger, this only increases the localcomputation to find the row pivot. Amount of data does notaffect the number of messages being passed in the networkand hence does not change the response time drastically.

These experiments provide an insight into the scalablenature of the distributed algorithm for solving the classifi-cation problem using linear programming approach.

10. CONCLUSION AND FUTURE WORK

With the increasing number of web users and theemergence of web 2.0, collaborative tagging is becomingincreasingly useful for organization and classification ofonline documents. In this paper, we describe a scalableweb-based P2P system—PADMINI, which is powered bya distributed classifier learning algorithm based on linearprogramming. Using the PADMINI system, the user caneasily tag text documents on any online text repositoriesand mine that data using a P2P environment. The classifierlearning computation is done among nodes in the P2Pnetwork, which makes the system very scalable. Moreover,the result of such collaborative input is made availableto all the users of the system. We presented extensive

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 16: Peer-to-peer distributed text classifier learning in PADMINI

16 Statistical Analysis and Data Mining, Vol. (In press)

10 30 60 80 1000

20

40

60

80

100

120

140

Number of Nodes

Res

pons

e T

ime

(sec

)

4 8 12 16 2020

30

40

50

60

70

80

90

100

Constraints per Node

Res

pons

e T

ime

(sec

)

50 Edges100 Edges

(a)

(b)

Fig. 10 Response time of P2P Classifier learning algorithm: (a)the number of nodes in the network (b) versus the number ofconstraints at each node [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

empirical results for testing the accuracy and scalabilityof the algorithm on the HST abstracts database. Ourresults indicate that increasing the number of variables andconstraints at a node does not affect communication costsignificantly. The communication cost mainly depends onthe number of nodes in the network. We plan to developadditional distributed data mining algorithms in order tosupport the PADMINI system.

ACKNOWLEDGMENTS

We thank Codrina Lauth for her help on text featureextraction. This work is supported in part by NASA GrantNNX07AV70G, the AFOSR MURI Grant 2008-11 and theIBM Innovation award.

REFERENCES

[1] A. Mathes, Folksonomies - cooperative classification andcommunication through shared metadata, 2004.

[2] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H.Kargupta, Distributed data mining in peer-to-peer networks,

IEEE Internet Computing special issue on Distributed DataMining 10 (2006), 18–26.

[3] H. Dutta and H. Kargupta, Distributed linear programmingand resource management for data mining in distributedenvironments, In ICDM Workshops, 2008, 543–552.

[4] H. Dutta, and A. Matthur, Distributed optimization strate-gies for mining on peer-to-peer networks, In MachineLearning and Applications, Dec 2008. ICMLA ’08 SeventhInternational Conference, 2008, 350–355.

[5] K. Bhaduri, and H. Kargupta, An efficient local algorithmfor distributed multivariate regression in peer-to-peer net-works, In SIAM International Conference on Data Mining(SIAM), Atlanta, Georgia, 2008, 153–164.

[6] K. Bhaduri, R. Wolff, C. Giannella, and H. Kargupta,Distributed identification of top-l inner product elementsand its application in a peer-to-peer network, Stat Anal DataMining 1(2) (2008), 85–103.

[7] R. Wolff, K. Bhaduri, and H. Kargupta, Local L2 Threshold-ing Based Data Mining in Peer-to-Peer Systems, In Proceed-ings of 2006 SIAM Conference on Data Mining, Bethesda,MD, April 2006.

[8] H. Dutta, X. Zhu, T. Mahule, H. Kargupta, K. Borne, C.Lauth, F. Holz, and G. Heyer, Taglearner: A p2p classifierlearning system from collaboratively tagged text documents,ICDMW ’09: Proceedings of the 2009 IEEE InternationalConference on Data Mining Workshops, IEEE ComputerSociety, Washington, DC, USA, 2009, 495–500.

[9] DDM bib. http://www.csee.umbc.edu/∼hillol/DDMBIB/ddmbib_html/DistClass.html.

[10] P. Luo, H. Xiong, K. Lu, and Z. Shi, Distributed classifica-tion in peer-to-peer networks, KDD ’07: Proceedings of the13th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, New York, NY, USA,ACM, 2007, 968–976.

[11] A. Lazarevic and Z. Obradovic, The distributed boostingalgorithm, KDD ’01: Proceedings of the Seventh ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining, ACM, New York, NY, USA, 2001,311–316.

[12] G. Tsoumakas, and I. Vlahavas, Effective stacking ofdistributed classifiers, 2002.

[13] O. L. Mangasarian, Linear and nonlinear separation ofpatterns by linear programming, Oper Res 13(3) (1965),444–452.May

[14] F. W. Smith, Pattern classifier design by linear program-ming, IEEE Trans Comput C-17(4) (1968) 367–372.

[15] P. S. Bradley, O. L. Mangasarian, and W. N. Street,Feature selection via mathematical programming, TechnicalReport 95-21, Computer Science Department, University ofWisconsin, 1995.

[16] P. S. Bradley, O. L. Mangasarian, and W. N. Street, Clus-tering via concave minimization, Technical Report 96-03,Computer Science Department, University of Wisconsin,1997.

[17] G. Fung, and O. L. Mangasarian, Data selection forsupport vector machine classification, Proceedings of theKnowledge Discovery and Data Mining (KDD), Boston,MA, August 2000.

[18] O. L. Mangasarian, J. W. Shavlik, and E. W. Wild,Knowledge-based kernel approximation, J Mach Learn Res5 (2004), 1127–1141.

[19] Y. J. Lee, and O. L. Mangasarian, Rsvm: reduced supportvector machines, SIAM International Conference on DataMining (SIAM), Chicago, IL, April 2001.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 17: Peer-to-peer distributed text classifier learning in PADMINI

Zhu et al.: P2P Distributed Text Classifier Learning in PADMINI 17

[20] G. B. Dantzig, Linear Programming and Extensions, NJ,Princeton University Press, 1963.

[21] R. M. Freund, and S. Mizuno, Interior point methods:current status and future directions, In High PerformanceOptimization, H. Frenk, ed. Springer, Kluwer AcademicPress, 2000, 441–446.

[22] N. Karmarkar, A new polynomial-time algorithm for linearprogramming, Combinatorica, 4 (1984) 373–395.

[23] S. Craig and D. Reed, Hypercube implementation of thesimplex algorithm, In Association of Computing Machinery(ACM), 1988, 1473–1482.

[24] G. Yarmish, A Distributed Implementation of the SimplexMethod, Ph.D. Thesis, Computer and Information Science,Polytechnic University, 2001.

[25] J. K. Ho, and R. P. Sundarraj, On the efficacy of distributedsimplex algorithms for linear programming, Comput OptimAppl 3(4) (1994), 349–363.

[26] E. Jonathan, I. Boduroglu, L. Polymenakos, and D.Goldfarb, Data-Parallel Implementations of Dense SimplexMethods on the Connection Machines CM-2, ORSA JComput (INFORMS) 7(4) (1995), 402–416.

[27] N. A. Syed, S. Huan, L. Kah, and K. Sung, Incrementallearning with support vector machines, 1999.

[28] C. Caragea, D. Caragea, and V. Honavar, Learning supportvector machines from distributed data sources, AAAI’05:Proceedings of the 20th National Conference on ArtificialIntelligence, AAAI Press, 2005, 1602–1603.

[29] H. P. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and V.Vapnik, Parallel support vector machines: the cascade svm,Advances in Neural Information Processing Systems, MITPress, 2005, 521–528.

[30] C. Hensel, and H. Dutta, GADGET SVM: a Gossip-bAseDsub-GradiEnT SVM solver, International Conference onMachine Learning (ICML), Numerical Mathematics inMachine Learning Workshop, Montreal, Canada, 2009.

[31] Yumao Lu, V. W. P. Roychowdhury, and L. Vandenberghe,Distributed parallel support vector machines in stronglyconnected networks, IEEE Trans Neural Networks 19(7)(2008), 1167–1178.

[32] L. V. S. Lakshmanan, S. A. Yahia, M. Benedikt, and J.Stoyanovich, Efficient network aware search in collabora-tive tagging sites, Proceedings of VLDB, 2008, 710–721.

[33] K. H. L. Tso-Sutter, L. B. Marinho, and L. Schmidt-Thieme,Tag-aware recommender systems by fusion of collaborativefiltering algorithms, SAC ’08: Proceedings of the 2008ACM Symposium on Applied Computing, ACM, NewYork, NY, USA, 2008, 1995–1999.

[34] C. H. Brooks, N. Montanez, Improved annotation of theblogosphere via autotagging and hierarchical clustering.WWW ’06: Proceedings of the 15th International Confer-ence on World Wide Web, New York, NY, USA, ACM,2006, 625–632.

[35] X. Li, L. Guo, and Y. E. Zhao, Tag-based social interestdiscovery, WWW ’08: Proceeding of the 17th InternationalConference on World Wide Web, New York, NY, USA,ACM, 2008, 675–684.

[36] S. Golder and B. A. Huberman, The structure of collabora-tive tagging systems, 2005.

[37] S. Niwa, T. Doi, and S. Honiden, Web page recommendersystem based on folksonomy mining for itng ’06 submis-sions, ITNG ’06: Proceedings of the Third InternationalConference on Information Technology: New Generations,Washington, DC, USA, IEEE Computer Society, 2006,388–393.

[38] T. Mahule, K. Borne, S. Dey, S. Arora, and H. Kargupta,Padmini: A peer-to-peer distributed astronomy data miningsystem and a case study, NASA Conference on IntelligentData Understanding (CIDU), Moffett Field, CA, 2010.

[39] International Virtual Observatory Alliance, http://www.ivoa.net/.

[40] US National Virtual Observatory, http://us-vo.org/.[41] B. Mayank, G. M. Hector, G. Aristides, and R. Motwani,

Estimating aggregates on a peer-to-peer network, TechnicalReport, Computer Science Department, Stanford University,2004.

[42] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S.Shenker, H. Sturgis, D. Swinehart, and D. Terry, Epidemicalgorithms for replicated database maintenance, PODC ’87:Proceedings of the Sixth Annual ACM Symposium onPrinciples of Distributed Computing, ACM Press, NewYork, NY, USA, 1987, 1–12.

[43] B. Pittel, On spreading a rumor, SIAM J Appl Math 47(1)(1987), 213–223.

Statistical Analysis and Data Mining DOI:10.1002/sam