65
Parallel Algorithms for Mining Parallel Algorithms for Mining Large LargeScale Data Scale Data Ed d Ch Edward Chang Director, Google Research, Beijing http://infolab stanford edu/~echang/ http://infolab.stanford.edu/ echang/ Ed Chang EMMDS 2009 1

EdChang - Parallel Algorithms For Mining Large Scale Data

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: EdChang - Parallel Algorithms For Mining Large Scale Data

Parallel Algorithms for Mining Parallel Algorithms for Mining g gg g

LargeLarge‐‐Scale DataScale Data

Ed d ChEdward ChangDirector, Google Research, Beijing

http://infolab stanford edu/~echang/http://infolab.stanford.edu/ echang/

Ed Chang EMMDS 2009 1

Page 2: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 2

Page 3: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 3

Page 4: EdChang - Parallel Algorithms For Mining Large Scale Data

Story of the Photos

• The photos on the first pages are statues of two

Story of the Photos

“bookkeepers” displayed at the British Museum. One bookkeeper keeps a list of good people, and the other a list of bad. (Who is who, can you tell ? ☺)

• When I first visited the museum in 1998, I did not take a photo of them to conserve films. During this trip (June 2009), capacity is no longer a concern or constraint. In fact, one can

kid d ll t ki h t l t f h t Tsee kids, grandmas all taking photos, a lot of photos. Ten years apart, data volume explodes.

• Data complexity also grows. • So, can these ancient “bookkeepers” still classify good from

bad? Is their capacity scalable to the data dimension and data quantity ?

Ed Chang EMMDS 2009 4

Page 5: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 5

Page 6: EdChang - Parallel Algorithms For Mining Large Scale Data

Outline

• Motivating Applications – Q&A System

– Social Ads

• Key Subroutines– Frequent Itemset Mining [ACM RS 08]

L Di i hl All i– Latent Dirichlet Allocation  [WWW 09, AAIM 09]

– Clustering [ECML 08]

– UserRank [Google TR 09]– UserRank [Google TR 09]

– Support Vector Machines [NIPS 07]

• Distributed Computing Perspectives

Ed Chang EMMDS 2009 6

Distributed Computing Perspectives

Page 7: EdChang - Parallel Algorithms For Mining Large Scale Data

Query: What are must‐see attractions at Yellowstone

Ed Chang EMMDS 2009 7

Page 8: EdChang - Parallel Algorithms For Mining Large Scale Data

Query: What are must‐see attractions at Yellowstone

Ed Chang EMMDS 2009 8

Page 9: EdChang - Parallel Algorithms For Mining Large Scale Data

Query: Must‐see attractions at Yosemite

Ed Chang EMMDS 2009 9

Page 10: EdChang - Parallel Algorithms For Mining Large Scale Data

Query: Must‐see attractions at Copenhagen

Ed Chang EMMDS 2009 10

Page 11: EdChang - Parallel Algorithms For Mining Large Scale Data

Query: Must‐see attractions at Beijing

Hotel adsHotel ads

Ed Chang EMMDS 2009 11

Page 12: EdChang - Parallel Algorithms For Mining Large Scale Data

Who is Yao Ming

Ed Chang EMMDS 2009 12

Page 13: EdChang - Parallel Algorithms For Mining Large Scale Data

Q&A Yao MingQ&A Yao Ming

Ed Chang EMMDS 2009 13

Page 14: EdChang - Parallel Algorithms For Mining Large Scale Data

Who is Yao Ming Y Mi R l t d Q&AWho is Yao Ming Yao Ming Related Q&As

Ed Chang EMMDS 2009 14

Page 15: EdChang - Parallel Algorithms For Mining Large Scale Data

Application: Google Q&A (Confucius)launched in China Thailand and Russialaunched in China, Thailand, and Russia

DC

Differentiated Content

U

Search

SU

Community

C

CCS/Q&A

Trigger a discussion/question session during search

CCS/Q&A

Discussion/Question

Provide labels to a Q (semi-automatically)Given a Q, find similar Qs and their As (automatically)Evaluate quality of an answer, relevance and originalityE l t d ti l i t i itiEvaluate user credentials in a topic sensitive wayRoute questions to expertsProvide most relevant, high-quality content for Search to index

Ed Chang EMMDS 2009 15

Page 16: EdChang - Parallel Algorithms For Mining Large Scale Data

Q&A Uses Machine Learning

Label suggestion using ML algorithms.

• Real Time topic-to-topic (T2T) recommendation using ML algorithms.

Gi t l t d hi h• Gives out related high quality links to previous questions before human answer appear.

Ed Chang EMMDS 2009 16

Page 17: EdChang - Parallel Algorithms For Mining Large Scale Data

Collaborative Filtering

111Books/Photos

1111111111111

Based on membership so far, and memberships of others

111

1111

Use

rs

and memberships of others

1111

11Predict further membership

1111111

11

Ed Chang EMMDS 2009 17

Page 18: EdChang - Parallel Algorithms For Mining Large Scale Data

Collaborative Filtering

Based on partially 111Books/Photos

? ?observed matrix

1111111111111 ?

??

Predict unobserved entries

Use

rs

111

1111 ??

?

I. Will user i enjoy photo j?U

1111

11?

?

?II. Will user i be interesting to user j?

III. Will photo i be related to photo j? 1111111

11 ??

?

Ed Chang EMMDS 2009 18

Page 19: EdChang - Parallel Algorithms For Mining Large Scale Data

FIM‐based RecommendationFIM based Recommendation

Ed Chang EMMDS 2009 19

Page 20: EdChang - Parallel Algorithms For Mining Large Scale Data

FIM Preliminaries• Observation 1: If an item A is not frequent, any pattern contains 

A won’t be frequent [R. Agrawal]use a threshold to eliminate infrequent items q

{A}   {A,B}• Observation 2: Patterns containing A are subsets of (or found 

from) transactions containing A [J. Han]from) transactions containing A [J. Han]divide‐and‐conquer: select transactions containing A to form a 

conditional database (CDB), and find patterns containing A from that conditional database{A, B}, {A, C}, {A}   CDB A{A, B}, {B, C}  CDB B

• Observation 3: To prevent the same pattern from being found in• Observation 3: To prevent the same pattern from being found in multiple CDBs, all itemsets are sorted by the same manner (e.g., by descending support)

Ed Chang EMMDS 2009 20

Page 21: EdChang - Parallel Algorithms For Mining Large Scale Data

Preprocessing

f: 4c: 4

• According to Observation 1, we 

h f

p: 3 f c a b m

f a c d g i m p

a b c f l m o

c: 4a: 3b: 3m: 3

f c a m pcount the support of each item by scanning the database andp: 3 f c a b m

f bo: 2d: 1

a b c f l m o

b f h j o

database, and eliminate those infrequent items from the

c b p

f c a m p

d: 1e: 1g: 1h: 1i: 1

b c k s p

a f c e l p m n

from the transactions.

• According to Observation 3, wei: 1

k: 1l : 1n: 1

Observation 3, we sort items in each transaction by the order of descending 

Ed Chang EMMDS 2009 21

n: 1 gsupport value.

Page 22: EdChang - Parallel Algorithms For Mining Large Scale Data

Parallel Projection

• According to Observation 2, we construct CDB of item A; then from this CDB, we find those patterns ; , pcontaining A

• How to construct the CDB of A? If t ti t i A thi t ti h ld i– If a transaction contains A, this transaction should appear in the CDB of A

– Given a transaction {B, A, C}, it should appear in the CDB of A the CDB of B and the CDB of CA, the CDB of B, and the CDB of C

• Dedup solution: using the order of items:– sort {B,A,C} by the order of items  <A,B,C>– Put <> into the CDB of A– Put <A> into the CDB of B– Put <A B> into the CDB of C

Ed Chang EMMDS 2009 22

Put <A,B> into the CDB of C

Page 23: EdChang - Parallel Algorithms For Mining Large Scale Data

Example of Projection

{ f c a m / f c a m / c b }f c a m p p:

b: { f c a / f / c }

{ f c a / f c a / f c a b }f c a b m

f b

m:

b:

{ f c / f c / f c }

{ f c a / f / c }f b

c b p a:

f c a m p c: { f / f / f }

Example of Projection of a database into CDBs.Left: sorted transactions in order of f, c, a, b, m, pRight: conditional databases of frequent items

Ed Chang EMMDS 2009 23

Page 24: EdChang - Parallel Algorithms For Mining Large Scale Data

Example of Projectionp j

{ f c a m / f c a m / c b }f c a m p p:

b: { f c a / f / c }

{ f c a / f c a / f c a b }f c a b m

f b

m:

b:

{ f c / f c / f c }

{ f c a / f / c }f b

c b p a:

f c a m p c: { f / f / f }

Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items

Ed Chang EMMDS 2009 24

Page 25: EdChang - Parallel Algorithms For Mining Large Scale Data

Example of ProjectionExample of Projection{ f c a m / f c a m / c b }f c a m p p:

b: { f c a / f / c }

{ f c a / f c a / f c a b }f c a b m

f b

m:

b:

{ f c / f c / f c }

{ f c a / f / c }f b

c b p a:

f c a m p c: { f / f / f }

Example of Projection of a database into CDBs.Left: sorted transactions; Right: conditional databases of frequent items

Ed Chang EMMDS 2009 25

Page 26: EdChang - Parallel Algorithms For Mining Large Scale Data

Recursive Projections [H Li et al ACM RS 08]Recursive Projections [H. Li, et al. ACM RS 08]

• Recursive projection form a search treeMapReduceMapReduceMapReduce a search tree

• Each node is a CDB• Using the order of items to 

t d li t d CDBc

MapReduceIteration 3Iteration 1

b

MapReduceIteration 2

MapReduce

D|ab D|abc prevent duplicated CDBs.• Each level of breath‐first 

search of the tree can be done by a MapReducec

a

c

D|ab D|abc

D|acD|a

D done by a MapReduce iteration.

• Once a CDB is small enough to fit in memory

c

bD|bcD|b

D

enough to fit in memory, we can invoke FP‐growth to mine this CDB, and no more growth of the sub‐

c D|c

Ed Chang EMMDS 2009 26

more growth of the subtree.

Page 27: EdChang - Parallel Algorithms For Mining Large Scale Data

Projection using MapReduce

key: value(conditional transactions)key: value

Map inputs(transactions)

Sorted transactions(with infrequent items eliminated)

Reduce outputs(patterns and supports)

key="": value

Reduce inputs(conditional databases)key: value

Map outputs

f a c d g i m p f c a m pp c : 3p : 3

{ f c a m / f c a m / c b }p:

m f : 3m c : 3

p:m:a:c:

f c a mf c afcf

p:{fcam/fcam/cb} p:3, pc:3

a b c f l m o f c a b mm c : 3m a : 3m f c : 3m f a : 3m c a : 3

{ f c a / f c a / f c a b }m:m:

b:a: c:

f c a b

f c af cf

a f c e l p m n

b c k s p

b f h j o f b

c b p

f c a m p

m f c a : 3

b: b : 3{ f c a / f / c }

{ f c / f c / f c }a: a : 3a f : 3

b: f

p: c b

b:p:

cf c a m a f : 3

a c : 3a f c : 3

c : 3c f : 3{ f / f / f }c:

p: m:a:c:

f c a mf c af cf

Ed Chang EMMDS 2009 27

c f : 3{ f / f / f }c:

Page 28: EdChang - Parallel Algorithms For Mining Large Scale Data

Collaborative Filtering

111Indicators/Diseases

1111111111111

alsBased on membership so far,

and memberships of others

111

1111

Indi

vidu

aand memberships of others

1111

11Predict further membership

1111111

11

Ed Chang EMMDS 2009 28

Page 29: EdChang - Parallel Algorithms For Mining Large Scale Data

Latent Semantic AnalysisUsers/Music/Ads/Question

ers

• Search– Construct a latent layer for better 

for semantic matching

sic/

Ads

/Ans

we

• Example:– iPhone crack– Apple pie

Use

rs/M

us

1 recipe pastry for a 9 inch double crust 9 apples, 2/1 cup,

brown sugar

How to install apps on Apple mobile phones?

Documents

• Other Collaborative Filtering Apps

Topic Distribution

• Other Collaborative Filtering Apps– Recommend Users  Users– Recommend Music  Users– Recommend Ads  Users

R d A Q

Topic Distribution

Ed Chang EMMDS 2009 29

– Recommend Answers  Q

• Predict the ? In the light‐blue cellsUser quries iPhone crack Apple pie

Page 30: EdChang - Parallel Algorithms For Mining Large Scale Data

Documents, Topics, WordsDocuments, Topics, Words

• A document consists of a number of topics• A document consists of a number of topics– A document is a probabilistic mixture of topics

• Each topic generates a number of words• Each topic generates a number of words– A topic is a distribution over words

th– The probability of the ith word in a document

Ed Chang EMMDS 2009 30

Page 31: EdChang - Parallel Algorithms For Mining Large Scale Data

Latent Dirichlet Allocation [M. Jordan 04]

• α: uniform Dirichlet φ prior for per document d topic αfor per document d topic distribution (corpus level parameter)β f hl φ θ• β: uniform Dirichlet φ prior for per topic z word distribution (corpus level 

)

θ

zparameter)• θd is the topic distribution of 

doc d (document level)

z

( )• zdj the topic if the jth word in 

d, wdj the specific word (word level)

w

Nm

βφ

K

Ed Chang EMMDS 2009 31

level) M

Page 32: EdChang - Parallel Algorithms For Mining Large Scale Data

LDA Gibbs Sampling: Inputs And OutputsLDA Gibbs Sampling: Inputs And Outputs

Inputs: 

1. training data: documents as bags of words

h b f

words topics

2. parameter: the number of topics

Outputs: 

1 by product: a co occurrence

docs docs

1. by‐product: a co‐occurrence matrix of topics and documents.

2. model parameters: a co‐ wordstopics

occurrence matrix of topics and words.

Ed Chang EMMDS 2009 32

Page 33: EdChang - Parallel Algorithms For Mining Large Scale Data

Parallel Gibbs Sampling [aaim 09]Parallel Gibbs Sampling [aaim 09]

Inputs: 

1. training data: documents as bags of words

h b f

words topics

2. parameter: the number of topics

Outputs: 

1 by product: a co occurrence

docs docs

1. by‐product: a co‐occurrence matrix of topics and documents.

2. model parameters: a co‐ wordstopics

occurrence matrix of topics and words.

Ed Chang EMMDS 2009 33

Page 34: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 34

Page 35: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 35

Page 36: EdChang - Parallel Algorithms For Mining Large Scale Data

Outline

• Motivating Applications – Q&A System

– Social Ads

• Key Subroutines– Frequent Itemset Mining [ACM RS 08]

L Di i hl All i– Latent Dirichlet Allocation  [WWW 09, AAIM 09]

– Clustering [ECML 08]

– UserRank [Google TR 09]– UserRank [Google TR 09]

– Support Vector Machines [NIPS 07]

• Distributed Computing Perspectives

Ed Chang EMMDS 2009 36

Distributed Computing Perspectives

Page 37: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 37

Page 38: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 38

Page 39: EdChang - Parallel Algorithms For Mining Large Scale Data

Open Social APIsOpen Social APIs1

Profiles (who I am)

Open Social

2

Friends (who I know)Stuff (what I have)

4

3

Activities (what I do)

Ed Chang EMMDS 2009 39

Page 40: EdChang - Parallel Algorithms For Mining Large Scale Data

Activities

Recommendations

Applications

Ed Chang EMMDS 2009 40

Applications

Page 41: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 41

Page 42: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 42

Page 43: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 43

Page 44: EdChang - Parallel Algorithms For Mining Large Scale Data

Ed Chang EMMDS 2009 44

Page 45: EdChang - Parallel Algorithms For Mining Large Scale Data

Task: Targeting Ads at SNS UsersTask: Targeting Ads at SNS UsersUsers

Ads

Ed Chang EMMDS 2009 45

Page 46: EdChang - Parallel Algorithms For Mining Large Scale Data

Mining Profiles, Friends & Activities for Relevancefor Relevance

Ed Chang EMMDS 2009 46

Page 47: EdChang - Parallel Algorithms For Mining Large Scale Data

Consider also User Influence

• Advertisers consider husers who are

– Relevant– InfluentialInfluential

• SNS Influence Analysis– CentralityCentrality– Credential– Activeness– etc.

Ed Chang EMMDS 2009 47

Page 48: EdChang - Parallel Algorithms For Mining Large Scale Data

Spectral Clustering [A. Ng, M. Jordan]p g [ g, ]

• Important subroutine in tasks of machine learning  and data mining– Exploit pairwise similarity of data instancesMore effective than traditional methods e g k means– More effective than traditional methods e.g., k‐means

• Key steps– Construct pairwise similarity matrixConstruct pairwise similarity matrix

• e.g., using Geodisc distance

– Compute the Laplacian matrixA l i d iti– Apply eigendecomposition

– Perform k‐means

Ed Chang EMMDS 2009 48

Page 49: EdChang - Parallel Algorithms For Mining Large Scale Data

Scalability ProblemScalability Problem• Quadratic computation of nxn matrix

• Approximation methods

Dense Matrix

Sparsification Nystrom Others

t NN ξ neighborhood random greedy

Ed Chang EMMDS 2009 49

t-NN ξ-neighborhood … random greedy ….

Page 50: EdChang - Parallel Algorithms For Mining Large Scale Data

Sparsification vs. Samplingp p g

• Construct the dense similarity matrix S

• Randomly sample lpoints where l << nsimilarity matrix S

• Sparsify S• Compute Laplacian

points, where l << n

• Construct dense similarity matrix [A B]• Compute Laplacian 

matrix L similarity matrix [A B] between l and n points

• Normalize A and B to be

• Apply ARPACLK on L

• Normalize A and B to be in Laplacian form

R = A + A‐1/2BBTA‐1/2 ;• Use k‐means to cluster rows of V into k groups

R   A + A BB A ;  

R = U∑UT

• k‐means

Ed Chang EMMDS 2009 50

Page 51: EdChang - Parallel Algorithms For Mining Large Scale Data

Empirical Study [song et al ecml 08]Empirical Study [song, et al., ecml 08]

• Dataset: RCV1 (Reuters Corpus Volume I)A filt d ll ti f 193 944 d t i 103– A filtered collection of 193,944 documents in 103categories

• Photo set: PicasaWeb– 637,137 photos

• Experiments– Clustering quality vs. computational time

• Measure the similarity between CAT and CLS • Normalized Mutual Information (NMI)

– Scalability

Ed Chang EMMDS 2009 51

Page 52: EdChang - Parallel Algorithms For Mining Large Scale Data

NMI Comparison (on RCV1)NMI Comparison (on RCV1)

N t th d S t i i ti

Ed Chang EMMDS 2009 52

Nystrom method Sparse matrix approximation

Page 53: EdChang - Parallel Algorithms For Mining Large Scale Data

Speedup Test on 637,137 PhotosSpeedup Test on 637,137 Photos• K = 1000 clusters

A hi li d h i 32 hi ft th t• Achiever linear speedup when using 32 machines, after that, sub‐linear speedup because of increasing communication and sync time

Ed Chang EMMDS 2009 53

Page 54: EdChang - Parallel Algorithms For Mining Large Scale Data

Sparsification vs. Samplingp p gSparsification Nystrom, random 

samplingsampling

Information Full n x n NoneInformation Full n x n similarity scores

None

Pre processing O(n2) worst case; O(nl) l << nPre‐processing Complexity (bottleneck)

O(n2) worst case; easily parallizable

O(nl), l << n

(bottleneck)

Effectiveness Good Not bad (Jitendra M., PAMI)

Ed Chang EMMDS 2009 54

Page 55: EdChang - Parallel Algorithms For Mining Large Scale Data

Outline

• Motivating Applications – Q&A System

– Social Ads

• Key Subroutines– Frequent Itemset Mining [ACM RS 08]

L Di i hl All i– Latent Dirichlet Allocation  [WWW 09, AAIM 09]

– Clustering [ECML 08]

– UserRank [Google TR 09]– UserRank [Google TR 09]

– Support Vector Machines [NIPS 07]

• Distributed Computing Perspectives

Ed Chang EMMDS 2009 55

Distributed Computing Perspectives

Page 56: EdChang - Parallel Algorithms For Mining Large Scale Data

Matrix Factorization AlternativesMatrix Factorization Alternatives

exact

approximate

Ed Chang EMMDS 2009 56

approximate

Page 57: EdChang - Parallel Algorithms For Mining Large Scale Data

PSVM [E Chang et al NIPS 07]PSVM [E. Chang, et al, NIPS 07]

• Column‐based ICF• Column based ICF– Slower than row‐based on single machine

Parallelizable on multiple machines– Parallelizable on multiple machines

• Changing IPM computation order to achieve ll li tiparallelization

Ed Chang EMMDS 2009 57

Page 58: EdChang - Parallel Algorithms For Mining Large Scale Data

SpeedupSpeedup

Ed Chang EMMDS 2009 58

Page 59: EdChang - Parallel Algorithms For Mining Large Scale Data

OverheadsOverheads

Ed Chang EMMDS 2009 59

Page 60: EdChang - Parallel Algorithms For Mining Large Scale Data

Comparison between Parallel kComputing Frameworks

MapReduce Project B MPI

GFS/IO and task rescheduling overhead between iterations

Yes No+1

No+1

Flexibility of computation model AllReduce only AllReduce only Flexibley p y+0.5

y+0.5 +1

Efficient AllReduce Yes+1

Yes+1

Yes+1

Recover from faults between iterations

Yes+1

Yes+1

Apps

Recover from faults within each Yes Yes AppsRecover from faults within each iteration

Yes+1

Yes+1

Apps

Final Score for scalable machine learning

3.5 4.5 5

Ed Chang EMMDS 2009 60

g

Page 61: EdChang - Parallel Algorithms For Mining Large Scale Data

Concluding Remarks• Applications demand scalable solutions• Have parallelized key subroutines for mining massive data sets

– Spectral Clustering [ECML 08]

– Frequent Itemset Mining [ACM RS 08]

– PLSA [KDD 08]– LDA [WWW 09]

– UserRank – Support Vector Machines [NIPS 07]

• Relevant papers– http://infolab.stanford.edu/~echang/

• Open Source PSVM PLDA• Open Source PSVM, PLDA– http://code.google.com/p/psvm/– http://code.google.com/p/plda/

Ed Chang EMMDS 2009 61

Page 62: EdChang - Parallel Algorithms For Mining Large Scale Data

CollaboratorsCollaborators

• Prof. Chih‐Jen Lin (NTU)( )• Hongjie Bai (Google)• Wen‐Yen Chen (UCSB)

( )• Jon Chu (MIT)• Haoyuan Li (PKU)• Yangqiu Song (Tsinghua)• Yangqiu Song (Tsinghua)• Matt Stanton (CMU)• Yi Wang (Google)g ( g )• Dong Zhang (Google)• Kaihua Zhu (Google)

Ed Chang EMMDS 2009 62

Page 63: EdChang - Parallel Algorithms For Mining Large Scale Data

ReferencesReferences[1] Alexa internet. http://www.alexa.com/.[2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international [ ] p

conference on Machine learning, pages 373-380, 2004.[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-

1022, 2003.[4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth

International Conference on Machine Learning, pages 167-174, 2000.[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In [ ] g p yp y

Advances in Neural Information Processing Systems 13, pages 430-436, 2001.[6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic

analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.

Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977.[8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE [8] S Ge a a d Ge a S oc as c e a a o , g bbs d s bu o s, a d e bayes a es o a o o ages

Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984.[9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Arti

cial Intelligence, pages 289-296, 1999.[10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89-

115, 2004.[11] A. McCallum, A. Corrada-Emmanuel, and X. Wang. The author-recipient-topic model for topic and role discovery in[11] A. McCallum, A. Corrada Emmanuel, and X. Wang. The author recipient topic model for topic and role discovery in

social networks: Experiments with enron and academic email. Technical report, Computer Science, University of Massachusetts Amherst, 2004.

[12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems 20, 2007.

[13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002.

Ed Chang EMMDS 2009 63

Page 64: EdChang - Parallel Algorithms For Mining Large Scale Data

References (cont )References (cont.)[14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative

ltering. In Proc. Of the 24th international conference on Machine learning, pages 791-798, 2007.g g, p g ,[15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social

network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 678-684, 2005.

[16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Gri� ths. Probabilistic author-topic models for information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-315, 2004.,

[17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:583-617, 2002.

[18] T. Zhang and V. S. Iyengar. Recommender systems using linear classiers. Journal of Machine Learning Research, 2:313-334, 2002.

[19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and Information Systems (KAIS), 8:374-384, 2005.o a o Sys e s ( S), 8 3 38 , 005

[20] L. Admic and E. Adar. How to search a social network. 2004[21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages

5228-5235, 2004.[22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering.

Communitcations of the ACM, 3:63-65, 1997.[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD

Rec., 22:207-116, 1993. [24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In

Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998.[25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-

177, 2004.

Ed Chang EMMDS 2009 64

Page 65: EdChang - Parallel Algorithms For Mining Large Scale Data

References (cont )References (cont.)[26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.

In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-

177, 2004.177, 2004.[28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.

In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.[29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM

International Conference on Data Mining, 2003.[30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry.

Communication of ACM 35, 12:61-70, 1992.[31] P Resnik N Iacovou M Suchak P Bergstrom and J Riedl Grouplens: An open architecture for aollaborative[31] P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for aollaborative

filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages 175-186, 1994.

[32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87, 1997.

[33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proceedings of ACM CHI, 1:210-217, 1995.

[34] G Ki d B S ith d J Y k A d ti it t it ll b ti filt i IEEE I t t[34] G. Kinden, B. Smith and J. York. Amazon.com recommendations: item-to-item collaborative filtering. IEEE Internet Computing, 7:76-80, 2003.

[35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177-196, 2001.

[36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint Conference in Artificial Intelligence, 1999.

[37] http://www.cs.carleton.edu/cs comps/0607/recommend/recommender/collaborativefiltering.html[ ] p _ p g[38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007.[39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community

recommendation, ACM KDD 2008.[40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008.

Ed Chang EMMDS 2009 65