Upload
gv-anirudh
View
221
Download
0
Embed Size (px)
Citation preview
8/8/2019 DM-ch-10-5(web data)
1/31
8/8/2019 DM-ch-10-5(web data)
2/31
1/8/2011 Data Mining: Principles and Algorithms 2
Web data
Web is huge for effective data warehousing anddata mining: Hundreds of terabytes
Complexity of web pages is far greater than that ofany traditional text document: Web pages lack uniform
structure, no indexing by category nor by title, author, coverpage, table of contents, etc.,
Web is a highly dynamic information source:Information is constantly updated. Linkage information andaccess records are also updated frequently.
Web serves a broad diversity of users: users mayhave different background, interests, usage purposes
Only a small portion of the information on the webis relevant or useful: 99% of the web information isuseless to 99% of web users.
8/8/2019 DM-ch-10-5(web data)
3/31
1/8/2011 Data Mining: Principles and Algorithms 3
Web data
Issues related to web mining:
Mining Web page layout structure
Mining Webs link structures
Mining Multimedia data on the web Automatic classification of web documents
Weblog mining
8/8/2019 DM-ch-10-5(web data)
4/31
1/8/2011 Data Mining: Principles and Algorithms 4
Outline
Background on Web Search
VIP
S (VIsion-based
Page Segmentation)
Block-based Web Search
Block-based Link Analysis
Web Image Search & Clustering
8/8/2019 DM-ch-10-5(web data)
5/31
1/8/2011 Data Mining: Principles and Algorithms 5
Search Engine Two Rank Functions
WebPages
Meta Data ForwardIndex
Inverted
Index
ForwardLink
Backward Link(Anchor Text)
Web TopologyGraph
Web Page Parser
Indexer
Anchor TextGenerator
Web GraphConstructor
Importance Ranking(Link Analysis)
Rank Functions
URLDictioanry
Term Dictionary(Lexicon)
Search
Relevance Ranking
Ranking based on link
structure analysis
Similarity
based on
content or text
8/8/2019 DM-ch-10-5(web data)
6/31
Inverted index- A data structure for supporting text queries
- like index in a book
Relevance Ranking
inverted index
aalborg 3452, 11437, .......arm 4, 19, 29, 98, 143, ...
armada 145, 457, 789, ...
armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.zz 602, 1189, 3209, ...
disks with
documents
indexing
8/8/2019 DM-ch-10-5(web data)
7/31
1/8/2011 Data Mining: Principles and Algorithms 7
The PageRank Algorithm
More precisely: Link graph: adjacency matrixA,
Constructs a probability transition matrix M by renormalizing each
row ofA to sum to 1
Treat the web graph as a markov chain (random surfer)
The vector ofPageRank scoresp is then defined to be the
stationary distribution of this Markov chain. Equivalently, p is the
principal right eigenvector of the transition matrix
1
0ij
if page i links to page j
A otherwise
!
(1 ) 1/ ,ijU M U n for all i jI I !
( (1 ) )TU MI I
( (1 ) )TU M p pI I !
Basic idea significance of a page is
determined by the significance of
the pages linking to it
8/8/2019 DM-ch-10-5(web data)
8/31
1/8/2011 Data Mining: Principles and Algorithms 8
Layout Structure
Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats,
separators, blank areas, colors, pictures, etc
Different parts of a page are not equally important
Title: CNN.com International
H1: IAEA: Iran had secret nuke agendaH3: EXPLOSIONS ROCK BAGHDAD
TEXT BODY (with position and fonttype): The International Atomic Energy
Agency has concluded that Iran hassecretly produced small amounts of
nuclear materials including low enricheduranium and plutonium that could be used
to develop nuclear weapons according to a
confidential report obtained by CNN
Hyperlink:
URL: http://www.cnn.com/...
Anchor Text: AI oaeda
Image:
URL: http://www.cnn.com/image/...
Alt & Caption: Iran nuclear
AnchorText: CNN Homepage News
8/8/2019 DM-ch-10-5(web data)
9/31
1/8/2011 Data Mining: Principles and Algorithms 9
Web Page BlockBetter Information Unit
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
8/8/2019 DM-ch-10-5(web data)
10/31
1/8/2011 Data Mining: Principles and Algorithms 10
Motivation for VIPS (VIsion-basedPage Segmentation)
Problems of treating a web page as an atomic unit
Web page usually contains not only pure content Noise: navigation, decoration, interaction,
Multiple topics
Different parts of a page are not equally important Web page has internal structure
Two-dimension logical structure & Visual layoutpresentation
> Free text document < Structured document
Layout the 3rd dimension ofWeb page
1st dimension: content
2nd dimension: hyperlink
8/8/2019 DM-ch-10-5(web data)
11/31
1/8/2011 Data Mining: Principles and Algorithms 11
Is DOM a Good Representation of PageStructure?
Page segmentation usingDOM
Extract structural tagssuch as P, TABLE, UL,TITLE, H1~H6, etc
DOM is morerelated contentdisplay, does notnecessarily reflectsemantic structure
How about XML?
A long way to go toreplace the HTML
8/8/2019 DM-ch-10-5(web data)
12/31
1/8/2011 Data Mining: Principles and Algorithms 12
VIPS Algorithm
Motivation: In many cases, topics can be distinguished with visual clues. Such
as position, distance, font, color, etc.
Goal:
Extract the semantic structure of a web page based on its visual
presentation. Procedure:
Top-down partition the web page based on the separators
Result
A tree structure, each node in the tree corresponds to a block inthe page.
Each node will be assigned a value (Degree of Coherence) toindicate how coherent of the content in the block based on visualperception.
Each block will be assigned an importance value
Hierarchy or flat
8/8/2019 DM-ch-10-5(web data)
13/31
1/8/2011 Data Mining: Principles and Algorithms 13
VIPS: An Example
A hierarchical structure of layout block
A Degree of Coherence (DOC) is definedfor each block
Show the intra coherence of the block
DoCof child block must be no lessthan its parents
The Permitted Degree of Coherence
(P
DOC) can be pre-defined to achievedifferent granularities for the contentstructure
The segmentation will stop only whenall the blocks DoCis no less thanPDoC
The smaller the PDoC, the coarserthe content structure would be
8/8/2019 DM-ch-10-5(web data)
14/31
1/8/2011 Data Mining: Principles and Algorithms 14
Example of Web Page Segmentation (1)
( DOM Structure ) ( VIPS Structure )
8/8/2019 DM-ch-10-5(web data)
15/31
1/8/2011 Data Mining: Principles and Algorithms 15
Example of Web Page Segmentation (2)
Can be applied on web image retrieval
Surrounding text extraction
( DOM Structure ) ( VIPS Structure )
8/8/2019 DM-ch-10-5(web data)
16/31
1/8/2011 Data Mining: Principles and Algorithms 16
Web Page BlockBetter Information Unit
Page Segmentation
Vision based approach
Block Importance Modeling
Statistical learning
Importance = Med
Importance = Low
Importance = High
Web Page Blocks
8/8/2019 DM-ch-10-5(web data)
17/31
1/8/2011 Data Mining: Principles and Algorithms 17
Block-basedWeb Search
Index block instead of whole page
Block retrieval
Combing DocRank and BlockRank Block query expansion
Select expansion term from relevant blocks
8/8/2019 DM-ch-10-5(web data)
18/31
1/8/2011 Data Mining: Principles and Algorithms 18
Experiments
Dataset TREC 2001 Web Track
WT10g corpus (1.69 million pages), crawled at 1997.
50 queries (topics 501-550)
TREC 2002 Web Track
.GOV corpus (1.25 million pages), crawled at 2002. 49 queries (topics 551-560)
Retrieval System
Okapi, with weighting function BM2500
Preprocessing Stop-word list (about 220)
Do not use stemming
Do not consider phrase information
Tune the b, k1 and k3 to achieve the best baseline
8/8/2019 DM-ch-10-5(web data)
19/31
1/8/2011 Data Mining: Principles and Algorithms 19
Block Retrieval on TREC 2001 and TREC 2002
TREC 2001 Result TREC 2002 Result
0 0.2 0.4 0.6 0.8 115
15.5
16
16.5
17
17.5
18
Combin ing Parameter E
Avera
gePrecision(%)
VIPS (Block Retr ieval)Baseline (Doc Retr ieval)
0 0.2 0.4 0.6 0.8 113
13.5
14
14.5
15
15.5
16
16.5
17
Combining Parameter E
Avera
gePrecision(%)
VIP S (Block Retr ieval )Basel ine (Doc Retr ieval)
8/8/2019 DM-ch-10-5(web data)
20/31
1/8/2011 Data Mining: Principles and Algorithms 20
Query Expansion on TREC 2001 and TREC 2002
TREC 2001 Result TREC 2002 Result
3 5 10 20 30 12
14
16
18
20
22
24
Num ber o f b lock s/docs
AveragePrecision(%
)
B l ock Q E (V I PS )Fu l lDoc Q EBase l ine
3 5 10 20 30 10
12
14
16
18
Number of b locks/docs
Average
Precision(%)
Block QE (VIPS )Ful lDoc QEBasel ine
8/8/2019 DM-ch-10-5(web data)
21/31
1/8/2011 Data Mining: Principles and Algorithms 21
Block-level Link Analysis
C
A B
8/8/2019 DM-ch-10-5(web data)
22/31
1/8/2011 Data Mining: Principles and Algorithms 22
A Sample of User Browsing Behavior
8/8/2019 DM-ch-10-5(web data)
23/31
1/8/2011 Data Mining: Principles and Algorithms 23
Improving PageRank using Layout Structure
Z: block-to-page matrix (link structure)
X: page-to-block matrix (layout structure)
Block-level PageRank: Compute PageRank on the page-to-page graph
BlockRank:
Compute PageRank on the block-to-block graph
XZWP !
ZXWB !
!
other ise
pagepthetoblockbthefromlinkaisthereifsZ
thth
b
bp0
/1
functioni portanceblocktheisf
otherwise
pageptheinisblockbtheifbfX
thth
p
pb
!0
)(
8/8/2019 DM-ch-10-5(web data)
24/31
1/8/2011 Data Mining: Principles and Algorithms 24
Using Block-level PageRank to Improve Search
0 8 0 82 0 84 0 86 0 88 0 9 0 92 0 94 0 96 0 98 10 11 5
0 12
0 12 5
0 13
0 13 5
0 14
0 14 5
0 15
0 15 5
0 16
0 16 5
C o
bining
aa
ee
E
ve
age
ecision
L
R-Co
bina
io n
R-Co
bina
ion
Block-level PageRank achieves 15-25%improvement over PageRank (SIGIR04)
PageRank
Block-level
PageRank
Search = EIR_Score + (1- EPageRank
E
8/8/2019 DM-ch-10-5(web data)
25/31
1/8/2011 Data Mining: Principles and Algorithms 25
MiningWeb Images Using Layout &Link Structure (ACMMM04)
8/8/2019 DM-ch-10-5(web data)
26/31
1/8/2011 Data Mining: Principles and Algorithms 26
Image Graph Model & Spectral Analysis
Block-to-block graph:
Block-to-image matrix (container relation): Y
Image-to-image graph:
ImageRank Compute PageRank on the image graph
Image clustering
Graphical partitioning on the image graph
! other ise
bif Is
Y
iji
ij 0
1
YWYWI !
ZXWB !
8/8/2019 DM-ch-10-5(web data)
27/31
1/8/2011 Data Mining: Principles and Algorithms 27
ImageRank
Relevance Ranking Importance Ranking Combined Ranking
8/8/2019 DM-ch-10-5(web data)
28/31
1/8/2011 Data Mining: Principles and Algorithms 28
ImageRank vs. PageRank
Dataset
26.5 millions web pages
11.6 millions images
Query set 45 hot queries in Google image search statistics
Ground truth
Five volunteers were chosen to evaluate the top 100
results re-turned by the system (iFind) Ranking method
( ) ( ) (1 ) ( )importance relevances rank rank E E! x x x
8/8/2019 DM-ch-10-5(web data)
29/31
1/8/2011 Data Mining: Principles and Algorithms 29
ImageRank vs PageRank
Image search accuracy using ImageRankand PageRank. Both of them achieved theirbest results at E=0.25.
Image search accuracy (ImageRank vs.P
ageRank)
0.58
0.6
0.62
0.64
0.66
0.68
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha
P@
0
ImageRank
PageRank
8/8/2019 DM-ch-10-5(web data)
30/31
1/8/2011 Data Mining: Principles and Algorithms 30
Example on Image Clustering &Embedding
1710 JPG images in 1287 pages are crawled within the websitehttp://www.yahooligans.com/content/animals/
Six Categories
Fish
Bird
MammalReptile
Amphibian Insect
8/8/2019 DM-ch-10-5(web data)
31/31
1/8/2011 Data Mining: Principles and Algorithms 31