DM-ch-10-5(web data)

8/8/2019 DM-ch-10-5(web data)

1/31

8/8/2019 DM-ch-10-5(web data)

2/31

1/8/2011 Data Mining: Principles and Algorithms 2

Web data

Web is huge for effective data warehousing anddata mining: Hundreds of terabytes

Complexity of web pages is far greater than that ofany traditional text document: Web pages lack uniform

structure, no indexing by category nor by title, author, coverpage, table of contents, etc.,

Web is a highly dynamic information source:Information is constantly updated. Linkage information andaccess records are also updated frequently.

Web serves a broad diversity of users: users mayhave different background, interests, usage purposes

Only a small portion of the information on the webis relevant or useful: 99% of the web information isuseless to 99% of web users.

8/8/2019 DM-ch-10-5(web data)

3/31


Web data

Issues related to web mining:

Mining Web page layout structure

Mining Webs link structures

Mining Multimedia data on the web Automatic classification of web documents

Weblog mining

8/8/2019 DM-ch-10-5(web data)

4/31


Outline

Background on Web Search

VIP

S (VIsion-based

Page Segmentation)

Block-based Web Search

Block-based Link Analysis

Web Image Search & Clustering

8/8/2019 DM-ch-10-5(web data)

5/31


Search Engine Two Rank Functions

WebPages

Meta Data ForwardIndex

Inverted

Index

ForwardLink

Backward Link(Anchor Text)

Web TopologyGraph

Web Page Parser

Indexer

Anchor TextGenerator

Web GraphConstructor

Importance Ranking(Link Analysis)

Rank Functions

URLDictioanry

Term Dictionary(Lexicon)

Search

Relevance Ranking

Ranking based on link

structure analysis

Similarity

based on

content or text

8/8/2019 DM-ch-10-5(web data)

6/31

Inverted index- A data structure for supporting text queries

- like index in a book

Relevance Ranking

inverted index

aalborg 3452, 11437, .......arm 4, 19, 29, 98, 143, ...

armada 145, 457, 789, ...

armadillo 678, 2134, 3970, ...

armani 90, 256, 372, 511, ...

.

.

.

.

.zz 602, 1189, 3209, ...

disks with

documents

indexing

8/8/2019 DM-ch-10-5(web data)

7/31


The PageRank Algorithm

More precisely: Link graph: adjacency matrixA,

Constructs a probability transition matrix M by renormalizing each

row ofA to sum to 1

Treat the web graph as a markov chain (random surfer)

The vector ofPageRank scoresp is then defined to be the

stationary distribution of this Markov chain. Equivalently, p is the

principal right eigenvector of the transition matrix

1

0ij

if page i links to page j

A otherwise

!

(1 ) 1/ ,ijU M U n for all i jI I !

( (1 ) )TU MI I

( (1 ) )TU M p pI I !

Basic idea significance of a page is

determined by the significance of

the pages linking to it

8/8/2019 DM-ch-10-5(web data)

8/31


Layout Structure

Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats,

separators, blank areas, colors, pictures, etc

Different parts of a page are not equally important

Title: CNN.com International

H1: IAEA: Iran had secret nuke agendaH3: EXPLOSIONS ROCK BAGHDAD

TEXT BODY (with position and fonttype): The International Atomic Energy

Agency has concluded that Iran hassecretly produced small amounts of

nuclear materials including low enricheduranium and plutonium that could be used

to develop nuclear weapons according to a

confidential report obtained by CNN

Hyperlink:

URL: http://www.cnn.com/...

Anchor Text: AI oaeda

Image:

URL: http://www.cnn.com/image/...

Alt & Caption: Iran nuclear

AnchorText: CNN Homepage News

8/8/2019 DM-ch-10-5(web data)

9/31


Web Page BlockBetter Information Unit

Importance = Med

Importance = Low

Importance = High

Web Page Blocks

8/8/2019 DM-ch-10-5(web data)

10/31


Motivation for VIPS (VIsion-basedPage Segmentation)

Problems of treating a web page as an atomic unit

Web page usually contains not only pure content Noise: navigation, decoration, interaction,

Multiple topics

Different parts of a page are not equally important Web page has internal structure

Two-dimension logical structure & Visual layoutpresentation

> Free text document < Structured document

Layout the 3rd dimension ofWeb page

1st dimension: content

2nd dimension: hyperlink

8/8/2019 DM-ch-10-5(web data)

11/31


Is DOM a Good Representation of PageStructure?

Page segmentation usingDOM

Extract structural tagssuch as P, TABLE, UL,TITLE, H1~H6, etc

DOM is morerelated contentdisplay, does notnecessarily reflectsemantic structure

How about XML?

A long way to go toreplace the HTML

8/8/2019 DM-ch-10-5(web data)

12/31


VIPS Algorithm

Motivation: In many cases, topics can be distinguished with visual clues. Such

as position, distance, font, color, etc.

Goal:

Extract the semantic structure of a web page based on its visual

presentation. Procedure:

Top-down partition the web page based on the separators

Result

A tree structure, each node in the tree corresponds to a block inthe page.

Each node will be assigned a value (Degree of Coherence) toindicate how coherent of the content in the block based on visualperception.

Each block will be assigned an importance value

Hierarchy or flat

8/8/2019 DM-ch-10-5(web data)

13/31


VIPS: An Example

A hierarchical structure of layout block

A Degree of Coherence (DOC) is definedfor each block

Show the intra coherence of the block

DoCof child block must be no lessthan its parents

The Permitted Degree of Coherence

(P

DOC) can be pre-defined to achievedifferent granularities for the contentstructure

The segmentation will stop only whenall the blocks DoCis no less thanPDoC

The smaller the PDoC, the coarserthe content structure would be

8/8/2019 DM-ch-10-5(web data)

14/31


Example of Web Page Segmentation (1)

( DOM Structure ) ( VIPS Structure )

8/8/2019 DM-ch-10-5(web data)

15/31


Example of Web Page Segmentation (2)

Can be applied on web image retrieval

Surrounding text extraction

( DOM Structure ) ( VIPS Structure )

8/8/2019 DM-ch-10-5(web data)

16/31


Web Page BlockBetter Information Unit

Page Segmentation

Vision based approach

Block Importance Modeling

Statistical learning

Importance = Med

Importance = Low

Importance = High

Web Page Blocks

8/8/2019 DM-ch-10-5(web data)

17/31


Block-basedWeb Search

Index block instead of whole page

Block retrieval

Combing DocRank and BlockRank Block query expansion

Select expansion term from relevant blocks

8/8/2019 DM-ch-10-5(web data)

18/31


Experiments

Dataset TREC 2001 Web Track

WT10g corpus (1.69 million pages), crawled at 1997.

50 queries (topics 501-550)

TREC 2002 Web Track

.GOV corpus (1.25 million pages), crawled at 2002. 49 queries (topics 551-560)

Retrieval System

Okapi, with weighting function BM2500

Preprocessing Stop-word list (about 220)

Do not use stemming

Do not consider phrase information

Tune the b, k1 and k3 to achieve the best baseline

8/8/2019 DM-ch-10-5(web data)

19/31


Block Retrieval on TREC 2001 and TREC 2002

TREC 2001 Result TREC 2002 Result

0 0.2 0.4 0.6 0.8 115

15.5

16

16.5

17

17.5

18

Combin ing Parameter E

Avera

gePrecision(%)

VIPS (Block Retr ieval)Baseline (Doc Retr ieval)

0 0.2 0.4 0.6 0.8 113

13.5

14

14.5

15

15.5

16

16.5

17

Combining Parameter E

Avera

gePrecision(%)

VIP S (Block Retr ieval )Basel ine (Doc Retr ieval)

8/8/2019 DM-ch-10-5(web data)

20/31


Query Expansion on TREC 2001 and TREC 2002

TREC 2001 Result TREC 2002 Result

3 5 10 20 30 12

14

16

18

20

22

24

Num ber o f b lock s/docs

AveragePrecision(%

)

B l ock Q E (V I PS )Fu l lDoc Q EBase l ine

3 5 10 20 30 10

12

14

16

18

Number of b locks/docs

Average

Precision(%)

Block QE (VIPS )Ful lDoc QEBasel ine

8/8/2019 DM-ch-10-5(web data)

21/31


Block-level Link Analysis

C

A B

8/8/2019 DM-ch-10-5(web data)

22/31


A Sample of User Browsing Behavior

8/8/2019 DM-ch-10-5(web data)

23/31


Improving PageRank using Layout Structure

Z: block-to-page matrix (link structure)

X: page-to-block matrix (layout structure)

Block-level PageRank: Compute PageRank on the page-to-page graph

BlockRank:

Compute PageRank on the block-to-block graph

XZWP !

ZXWB !

!

other ise

pagepthetoblockbthefromlinkaisthereifsZ

thth

b

bp0

/1

functioni portanceblocktheisf

otherwise

pageptheinisblockbtheifbfX

thth

p

pb

!0

)(

8/8/2019 DM-ch-10-5(web data)

24/31


Using Block-level PageRank to Improve Search

0 8 0 82 0 84 0 86 0 88 0 9 0 92 0 94 0 96 0 98 10 11 5

0 12

0 12 5

0 13

0 13 5

0 14

0 14 5

0 15

0 15 5

0 16

0 16 5

C o

bining

aa

ee

E

ve

age

ecision

L

R-Co

bina

io n

R-Co

bina

ion

Block-level PageRank achieves 15-25%improvement over PageRank (SIGIR04)

PageRank

Block-level

PageRank

Search = EIR_Score + (1- EPageRank

E

8/8/2019 DM-ch-10-5(web data)

25/31


MiningWeb Images Using Layout &Link Structure (ACMMM04)

8/8/2019 DM-ch-10-5(web data)

26/31


Image Graph Model & Spectral Analysis

Block-to-block graph:

Block-to-image matrix (container relation): Y

Image-to-image graph:

ImageRank Compute PageRank on the image graph

Image clustering

Graphical partitioning on the image graph

! other ise

bif Is

Y

iji

ij 0

1

YWYWI !

ZXWB !

8/8/2019 DM-ch-10-5(web data)

27/31


ImageRank

Relevance Ranking Importance Ranking Combined Ranking

8/8/2019 DM-ch-10-5(web data)

28/31


ImageRank vs. PageRank

Dataset

26.5 millions web pages

11.6 millions images

Query set 45 hot queries in Google image search statistics

Ground truth

Five volunteers were chosen to evaluate the top 100

results re-turned by the system (iFind) Ranking method

( ) ( ) (1 ) ( )importance relevances rank rank E E! x x x

8/8/2019 DM-ch-10-5(web data)

29/31


ImageRank vs PageRank

Image search accuracy using ImageRankand PageRank. Both of them achieved theirbest results at E=0.25.

Image search accuracy (ImageRank vs.P

ageRank)

0.58

0.6

0.62

0.64

0.66

0.68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

alpha

P@

0

ImageRank

PageRank

8/8/2019 DM-ch-10-5(web data)

30/31


Example on Image Clustering &Embedding

1710 JPG images in 1287 pages are crawled within the websitehttp://www.yahooligans.com/content/animals/

Six Categories

Fish

Bird

MammalReptile

Amphibian Insect

8/8/2019 DM-ch-10-5(web data)

31/31


Documents

DM-ch-10-5(web data)