Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using

Clustering of Web DocumentsClustering of Web Documents

Jinfeng ChenJinfeng Chen

Zhong Su, Qiang Yang, HongHiang Zhang, Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Xiaowei Xu and Yuhen Hu, Correlation-Correlation-based Document Clustering using Web based Document Clustering using Web LogsLogs, 2001. , 2001.

Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Hua-Jun Zeng ,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Ma and Jinwen Ma,Learning to Cluster Web Learning to Cluster Web Search ResultsSearch Results

Correlation-based Correlation-based Document Clustering using Web LogsDocument Clustering using Web Logs

Introduction Introduction Using web log data to construct clusters.Using web log data to construct clusters.

Frequent simultaneous visits to two seemingly Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are unrelated documents should indicate that they are in fact closely related.in fact closely related.

Basic algorithm is DBSCAN, an algorithm to group Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters neighboring objects of the database into clusters based on local distance information.based on local distance information.

DBSCANDBSCAN

Does not require the user to pre-specify the number of Does not require the user to pre-specify the number of clusters.clusters.

Only one scan through the database.Only one scan through the database. A radius value A radius value εε and a value and a value MptsMpts..

εε - distance measure (radius) - distance measure (radius)

Mpts – number of minimal points that should occur in Mpts – number of minimal points that should occur in around a dense objectaround a dense object

DBSCAN algorithmDBSCAN algorithm (con’d)(con’d)

Algorithm DBSCAN(DB, Algorithm DBSCAN(DB, εε,Minpts),Minpts)

for each o belong to DB dofor each o belong to DB do

if o is not yet assigned to a clusterif o is not yet assigned to a cluster

if o is a core-object thenif o is a core-object then

collect all objects density-reachable form ocollect all objects density-reachable form o

according to according to εε and MinPts and MinPts

assign them to a new cluster;assign them to a new cluster;

Limitations of DBSCAN in Clustering of Limitations of DBSCAN in Clustering of web documentweb document

Performance clustering using a fixed threshold value to Performance clustering using a fixed threshold value to determine “dense” regions in the document space. determine “dense” regions in the document space.

Thus the algorithm often cannot distinguish between Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space dense and loose points, often the entire document space is lumped into a single cluster.is lumped into a single cluster.

bridge

RDBC algorithmRDBC algorithm(recursive density based clustering)(recursive density based clustering) Key difference between RDBC and DBSCAN is that in Key difference between RDBC and DBSCAN is that in

RDBC, the identification of core points are performed RDBC, the identification of core points are performed separately from that of clustering each individual data separately from that of clustering each individual data points.points.

Different values of Different values of εε and Mpts are used in RDBC to and Mpts are used in RDBC to identify this core point set, Cset.identify this core point set, Cset.

RDBC algorithmRDBC algorithm (con’d)(con’d)

For avoid connecting too many clusters through “bridgeFor avoid connecting too many clusters through “bridge”” Set initial value Set initial value εε==εε1 and Mpts=Mpts1;1 and Mpts=Mpts1; WebPageSet=web_logWebPageSet=web_log RDBC(RDBC(εε,Mpts, WebPageSet) {,Mpts, WebPageSet) { use use εε, Mpts to get the core point Cset, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2if size (Cset > size(webPageSet)/2 { DBSCAN({ DBSCAN(εε,Mpts, WebPageSet) },Mpts, WebPageSet) } elseelse { { εε= = εε/2; Mpts=Mpts/4;/2; Mpts=Mpts/4; RDBC (RDBC (εε,, Mpts, WebPageSet);Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset)Collect all other points in (WebPageSet-Cset) around clusters found in last step according to around clusters found in last step according to εε22

}} }}

Construct WebPageSet from web logsConstruct WebPageSet from web logs

Step 1Step 1

Step 2Step 2 Delete visit of image files.Delete visit of image files. Step 3Step 3 Extract sessions from the data.Extract sessions from the data.

T

Visi

ts

Construct WebPageSet Construct WebPageSet (con’d)(con’d)

Step 4 Create a distance matrixStep 4 Create a distance matrix

1) 1) Determine the size of a moving window,Determine the size of a moving window,

within which URL requests within which URL requests

will be regarded as co-occurrence.will be regarded as co-occurrence.

2) Calculate the co-occurrence times N2) Calculate the co-occurrence times N i,,ji,,j, and, and

NNii, N, Nj j of this pair of URL’s.of this pair of URL’s.

sessi on

wi ndow

Pi

Pj

Co-occurrence t i me

Ni

Construct WebPageSet Construct WebPageSet (con’d)(con’d)

Step 4 Create a distance matrixStep 4 Create a distance matrix 33) P(p) P(pii| p| pjj)= N)= Ni,ji,j /N /Njj

4) Three Distance function4) Three Distance function

Experimental ValidationExperimental Validation

ConclusionsConclusions

A new algorithm for clustering web A new algorithm for clustering web documents based only on the log data.documents based only on the log data.

It change the parameters intelligently It change the parameters intelligently during the recursively process, RDBC can during the recursively process, RDBC can give clustering results more superior than give clustering results more superior than that of DBSCANthat of DBSCAN

Learning to Cluster Web Search ResultsLearning to Cluster Web Search Results

Introduction Introduction This algorithm based on salient phrase come This algorithm based on salient phrase come

from documents contents.from documents contents.

Fast enough to be used in online calculation Fast enough to be used in online calculation engine.engine.

Characteristics of Cluster web search resultsCharacteristics of Cluster web search results

Existing search engines such as Google ,Yahoo Existing search engines such as Google ,Yahoo and MSN often return long list of search results.and MSN often return long list of search results.

Clustering of similar search results helps users Clustering of similar search results helps users find relevant results.find relevant results.

Clustered Search resultsClustered Search results

Conventional Search resultsConventional Search results

Procedure of algorithmProcedure of algorithm

Step 1: Search result fetchingStep 1: Search result fetching

Step 2: Document paring and Phrase Step 2: Document paring and Phrase property calculationproperty calculation

Step 3: Salient phrase rankingStep 3: Salient phrase ranking

Search result fetchingSearch result fetching

Input a query to a conventional web Input a query to a conventional web search enginesearch engine

Getting the webpage of results returned by Getting the webpage of results returned by engine.engine.

Extracting the title and snippets.Extracting the title and snippets.

Document parsingDocument parsing

Step 1: CleaningStep 1: Cleaning Stemming (use Porter’ algorithm)Stemming (use Porter’ algorithm) Sentence boundary identificationSentence boundary identification

Step 2:Post-processingStep 2:Post-processing• Punctuation eliminationPunctuation elimination• Filter out stop-words, ex: ‘too’ ‘are’ Filter out stop-words, ex: ‘too’ ‘are’ • Filter out query wordFilter out query word• Ex: Ex: Microsoft softwareMicrosoft software is available to students. is available to students.

Phrase property calculationPhrase property calculation Five properties Five properties 1.1.Phrase Frequency/Inverted Document FrequencyPhrase Frequency/Inverted Document Frequency

2.Phrase Length2.Phrase Length

LEN=n ex:LEN(”big”) =1 LEN=n ex:LEN(”big”) =1

Phrase property calculation (con’d)Phrase property calculation (con’d)

3.Intra-Cluster Similarity3.Intra-Cluster Similarity

o: centroido: centroid

Here di={TFIDF1,TFIDF2,…},Here di={TFIDF1,TFIDF2,…}, Each component of the vectors represents TFIDF of a Each component of the vectors represents TFIDF of a

phrasephrase

Phrase property calculation (con’d)Phrase property calculation (con’d)4. Cluster Entropy4. Cluster Entropy

5. Phrase Independence5. Phrase Independence

Ex: three “vectors” has… Ex: three “vectors” has… with some “vectors” be…with some “vectors” be…

Learning to rank key phrasesLearning to rank key phrases

Using Regression model to combine Using Regression model to combine above five properties, calculating a single above five properties, calculating a single salience score for each phrasesalience score for each phrase

Regression is a algorithm which tries to Regression is a algorithm which tries to determine the relationship between two determine the relationship between two random variables X=(x1,x2,…xn) and y.random variables X=(x1,x2,…xn) and y.

Here x=(TFIDF,LEN,ICS,CE,IND)Here x=(TFIDF,LEN,ICS,CE,IND)

Learning to rank key phrasesLearning to rank key phrases Three RegressionThree Regression

• Linear RegressionLinear Regression

Logistic RegressionLogistic Regression

• Support Vector RegressionSupport Vector Regression

EvaluationEvaluation

ConclusionsConclusions

Change the search result clustering Change the search result clustering problem to be a supervised salient phrase problem to be a supervised salient phrase ranking problem.ranking problem.

Generate the correct clusters with short Generate the correct clusters with short name, thus could improve user’s browsing name, thus could improve user’s browsing efficiency through search result.efficiency through search result.

Thanks!Thanks!