DM-ch-10-5(web data)

Embed Size (px)

Citation preview

  • 8/8/2019 DM-ch-10-5(web data)

    1/31

  • 8/8/2019 DM-ch-10-5(web data)

    2/31

    1/8/2011 Data Mining: Principles and Algorithms 2

    Web data

    Web is huge for effective data warehousing anddata mining: Hundreds of terabytes

    Complexity of web pages is far greater than that ofany traditional text document: Web pages lack uniform

    structure, no indexing by category nor by title, author, coverpage, table of contents, etc.,

    Web is a highly dynamic information source:Information is constantly updated. Linkage information andaccess records are also updated frequently.

    Web serves a broad diversity of users: users mayhave different background, interests, usage purposes

    Only a small portion of the information on the webis relevant or useful: 99% of the web information isuseless to 99% of web users.

  • 8/8/2019 DM-ch-10-5(web data)

    3/31

    1/8/2011 Data Mining: Principles and Algorithms 3

    Web data

    Issues related to web mining:

    Mining Web page layout structure

    Mining Webs link structures

    Mining Multimedia data on the web Automatic classification of web documents

    Weblog mining

  • 8/8/2019 DM-ch-10-5(web data)

    4/31

    1/8/2011 Data Mining: Principles and Algorithms 4

    Outline

    Background on Web Search

    VIP

    S (VIsion-based

    Page Segmentation)

    Block-based Web Search

    Block-based Link Analysis

    Web Image Search & Clustering

  • 8/8/2019 DM-ch-10-5(web data)

    5/31

    1/8/2011 Data Mining: Principles and Algorithms 5

    Search Engine Two Rank Functions

    WebPages

    Meta Data ForwardIndex

    Inverted

    Index

    ForwardLink

    Backward Link(Anchor Text)

    Web TopologyGraph

    Web Page Parser

    Indexer

    Anchor TextGenerator

    Web GraphConstructor

    Importance Ranking(Link Analysis)

    Rank Functions

    URLDictioanry

    Term Dictionary(Lexicon)

    Search

    Relevance Ranking

    Ranking based on link

    structure analysis

    Similarity

    based on

    content or text

  • 8/8/2019 DM-ch-10-5(web data)

    6/31

    Inverted index- A data structure for supporting text queries

    - like index in a book

    Relevance Ranking

    inverted index

    aalborg 3452, 11437, .......arm 4, 19, 29, 98, 143, ...

    armada 145, 457, 789, ...

    armadillo 678, 2134, 3970, ...

    armani 90, 256, 372, 511, ...

    .

    .

    .

    .

    .zz 602, 1189, 3209, ...

    disks with

    documents

    indexing

  • 8/8/2019 DM-ch-10-5(web data)

    7/31

    1/8/2011 Data Mining: Principles and Algorithms 7

    The PageRank Algorithm

    More precisely: Link graph: adjacency matrixA,

    Constructs a probability transition matrix M by renormalizing each

    row ofA to sum to 1

    Treat the web graph as a markov chain (random surfer)

    The vector ofPageRank scoresp is then defined to be the

    stationary distribution of this Markov chain. Equivalently, p is the

    principal right eigenvector of the transition matrix

    1

    0ij

    if page i links to page j

    A otherwise

    !

    (1 ) 1/ ,ijU M U n for all i jI I !

    ( (1 ) )TU MI I

    ( (1 ) )TU M p pI I !

    Basic idea significance of a page is

    determined by the significance of

    the pages linking to it

  • 8/8/2019 DM-ch-10-5(web data)

    8/31

    1/8/2011 Data Mining: Principles and Algorithms 8

    Layout Structure

    Compared to plain text, a web page is a 2D presentation Rich visual effects created by different term types, formats,

    separators, blank areas, colors, pictures, etc

    Different parts of a page are not equally important

    Title: CNN.com International

    H1: IAEA: Iran had secret nuke agendaH3: EXPLOSIONS ROCK BAGHDAD

    TEXT BODY (with position and fonttype): The International Atomic Energy

    Agency has concluded that Iran hassecretly produced small amounts of

    nuclear materials including low enricheduranium and plutonium that could be used

    to develop nuclear weapons according to a

    confidential report obtained by CNN

    Hyperlink:

    URL: http://www.cnn.com/...

    Anchor Text: AI oaeda

    Image:

    URL: http://www.cnn.com/image/...

    Alt & Caption: Iran nuclear

    AnchorText: CNN Homepage News

  • 8/8/2019 DM-ch-10-5(web data)

    9/31

    1/8/2011 Data Mining: Principles and Algorithms 9

    Web Page BlockBetter Information Unit

    Importance = Med

    Importance = Low

    Importance = High

    Web Page Blocks

  • 8/8/2019 DM-ch-10-5(web data)

    10/31

    1/8/2011 Data Mining: Principles and Algorithms 10

    Motivation for VIPS (VIsion-basedPage Segmentation)

    Problems of treating a web page as an atomic unit

    Web page usually contains not only pure content Noise: navigation, decoration, interaction,

    Multiple topics

    Different parts of a page are not equally important Web page has internal structure

    Two-dimension logical structure & Visual layoutpresentation

    > Free text document < Structured document

    Layout the 3rd dimension ofWeb page

    1st dimension: content

    2nd dimension: hyperlink

  • 8/8/2019 DM-ch-10-5(web data)

    11/31

    1/8/2011 Data Mining: Principles and Algorithms 11

    Is DOM a Good Representation of PageStructure?

    Page segmentation usingDOM

    Extract structural tagssuch as P, TABLE, UL,TITLE, H1~H6, etc

    DOM is morerelated contentdisplay, does notnecessarily reflectsemantic structure

    How about XML?

    A long way to go toreplace the HTML

  • 8/8/2019 DM-ch-10-5(web data)

    12/31

    1/8/2011 Data Mining: Principles and Algorithms 12

    VIPS Algorithm

    Motivation: In many cases, topics can be distinguished with visual clues. Such

    as position, distance, font, color, etc.

    Goal:

    Extract the semantic structure of a web page based on its visual

    presentation. Procedure:

    Top-down partition the web page based on the separators

    Result

    A tree structure, each node in the tree corresponds to a block inthe page.

    Each node will be assigned a value (Degree of Coherence) toindicate how coherent of the content in the block based on visualperception.

    Each block will be assigned an importance value

    Hierarchy or flat

  • 8/8/2019 DM-ch-10-5(web data)

    13/31

    1/8/2011 Data Mining: Principles and Algorithms 13

    VIPS: An Example

    A hierarchical structure of layout block

    A Degree of Coherence (DOC) is definedfor each block

    Show the intra coherence of the block

    DoCof child block must be no lessthan its parents

    The Permitted Degree of Coherence

    (P

    DOC) can be pre-defined to achievedifferent granularities for the contentstructure

    The segmentation will stop only whenall the blocks DoCis no less thanPDoC

    The smaller the PDoC, the coarserthe content structure would be

  • 8/8/2019 DM-ch-10-5(web data)

    14/31

    1/8/2011 Data Mining: Principles and Algorithms 14

    Example of Web Page Segmentation (1)

    ( DOM Structure ) ( VIPS Structure )

  • 8/8/2019 DM-ch-10-5(web data)

    15/31

    1/8/2011 Data Mining: Principles and Algorithms 15

    Example of Web Page Segmentation (2)

    Can be applied on web image retrieval

    Surrounding text extraction

    ( DOM Structure ) ( VIPS Structure )

  • 8/8/2019 DM-ch-10-5(web data)

    16/31

    1/8/2011 Data Mining: Principles and Algorithms 16

    Web Page BlockBetter Information Unit

    Page Segmentation

    Vision based approach

    Block Importance Modeling

    Statistical learning

    Importance = Med

    Importance = Low

    Importance = High

    Web Page Blocks

  • 8/8/2019 DM-ch-10-5(web data)

    17/31

    1/8/2011 Data Mining: Principles and Algorithms 17

    Block-basedWeb Search

    Index block instead of whole page

    Block retrieval

    Combing DocRank and BlockRank Block query expansion

    Select expansion term from relevant blocks

  • 8/8/2019 DM-ch-10-5(web data)

    18/31

    1/8/2011 Data Mining: Principles and Algorithms 18

    Experiments

    Dataset TREC 2001 Web Track

    WT10g corpus (1.69 million pages), crawled at 1997.

    50 queries (topics 501-550)

    TREC 2002 Web Track

    .GOV corpus (1.25 million pages), crawled at 2002. 49 queries (topics 551-560)

    Retrieval System

    Okapi, with weighting function BM2500

    Preprocessing Stop-word list (about 220)

    Do not use stemming

    Do not consider phrase information

    Tune the b, k1 and k3 to achieve the best baseline

  • 8/8/2019 DM-ch-10-5(web data)

    19/31

    1/8/2011 Data Mining: Principles and Algorithms 19

    Block Retrieval on TREC 2001 and TREC 2002

    TREC 2001 Result TREC 2002 Result

    0 0.2 0.4 0.6 0.8 115

    15.5

    16

    16.5

    17

    17.5

    18

    Combin ing Parameter E

    Avera

    gePrecision(%)

    VIPS (Block Retr ieval)Baseline (Doc Retr ieval)

    0 0.2 0.4 0.6 0.8 113

    13.5

    14

    14.5

    15

    15.5

    16

    16.5

    17

    Combining Parameter E

    Avera

    gePrecision(%)

    VIP S (Block Retr ieval )Basel ine (Doc Retr ieval)

  • 8/8/2019 DM-ch-10-5(web data)

    20/31

    1/8/2011 Data Mining: Principles and Algorithms 20

    Query Expansion on TREC 2001 and TREC 2002

    TREC 2001 Result TREC 2002 Result

    3 5 10 20 30 12

    14

    16

    18

    20

    22

    24

    Num ber o f b lock s/docs

    AveragePrecision(%

    )

    B l ock Q E (V I PS )Fu l lDoc Q EBase l ine

    3 5 10 20 30 10

    12

    14

    16

    18

    Number of b locks/docs

    Average

    Precision(%)

    Block QE (VIPS )Ful lDoc QEBasel ine

  • 8/8/2019 DM-ch-10-5(web data)

    21/31

    1/8/2011 Data Mining: Principles and Algorithms 21

    Block-level Link Analysis

    C

    A B

  • 8/8/2019 DM-ch-10-5(web data)

    22/31

    1/8/2011 Data Mining: Principles and Algorithms 22

    A Sample of User Browsing Behavior

  • 8/8/2019 DM-ch-10-5(web data)

    23/31

    1/8/2011 Data Mining: Principles and Algorithms 23

    Improving PageRank using Layout Structure

    Z: block-to-page matrix (link structure)

    X: page-to-block matrix (layout structure)

    Block-level PageRank: Compute PageRank on the page-to-page graph

    BlockRank:

    Compute PageRank on the block-to-block graph

    XZWP !

    ZXWB !

    !

    other ise

    pagepthetoblockbthefromlinkaisthereifsZ

    thth

    b

    bp0

    /1

    functioni portanceblocktheisf

    otherwise

    pageptheinisblockbtheifbfX

    thth

    p

    pb

    !0

    )(

  • 8/8/2019 DM-ch-10-5(web data)

    24/31

    1/8/2011 Data Mining: Principles and Algorithms 24

    Using Block-level PageRank to Improve Search

    0 8 0 82 0 84 0 86 0 88 0 9 0 92 0 94 0 96 0 98 10 11 5

    0 12

    0 12 5

    0 13

    0 13 5

    0 14

    0 14 5

    0 15

    0 15 5

    0 16

    0 16 5

    C o

    bining

    aa

    ee

    E

    ve

    age

    ecision

    L

    R-Co

    bina

    io n

    R-Co

    bina

    ion

    Block-level PageRank achieves 15-25%improvement over PageRank (SIGIR04)

    PageRank

    Block-level

    PageRank

    Search = EIR_Score + (1- EPageRank

    E

  • 8/8/2019 DM-ch-10-5(web data)

    25/31

    1/8/2011 Data Mining: Principles and Algorithms 25

    MiningWeb Images Using Layout &Link Structure (ACMMM04)

  • 8/8/2019 DM-ch-10-5(web data)

    26/31

    1/8/2011 Data Mining: Principles and Algorithms 26

    Image Graph Model & Spectral Analysis

    Block-to-block graph:

    Block-to-image matrix (container relation): Y

    Image-to-image graph:

    ImageRank Compute PageRank on the image graph

    Image clustering

    Graphical partitioning on the image graph

    ! other ise

    bif Is

    Y

    iji

    ij 0

    1

    YWYWI !

    ZXWB !

  • 8/8/2019 DM-ch-10-5(web data)

    27/31

    1/8/2011 Data Mining: Principles and Algorithms 27

    ImageRank

    Relevance Ranking Importance Ranking Combined Ranking

  • 8/8/2019 DM-ch-10-5(web data)

    28/31

    1/8/2011 Data Mining: Principles and Algorithms 28

    ImageRank vs. PageRank

    Dataset

    26.5 millions web pages

    11.6 millions images

    Query set 45 hot queries in Google image search statistics

    Ground truth

    Five volunteers were chosen to evaluate the top 100

    results re-turned by the system (iFind) Ranking method

    ( ) ( ) (1 ) ( )importance relevances rank rank E E! x x x

  • 8/8/2019 DM-ch-10-5(web data)

    29/31

    1/8/2011 Data Mining: Principles and Algorithms 29

    ImageRank vs PageRank

    Image search accuracy using ImageRankand PageRank. Both of them achieved theirbest results at E=0.25.

    Image search accuracy (ImageRank vs.P

    ageRank)

    0.58

    0.6

    0.62

    0.64

    0.66

    0.68

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    alpha

    P@

    0

    ImageRank

    PageRank

  • 8/8/2019 DM-ch-10-5(web data)

    30/31

    1/8/2011 Data Mining: Principles and Algorithms 30

    Example on Image Clustering &Embedding

    1710 JPG images in 1287 pages are crawled within the websitehttp://www.yahooligans.com/content/animals/

    Six Categories

    Fish

    Bird

    MammalReptile

    Amphibian Insect

  • 8/8/2019 DM-ch-10-5(web data)

    31/31

    1/8/2011 Data Mining: Principles and Algorithms 31