Data.Mining.C.8(Ii).Web Mining 570802461

1

Chapter 8.

Mining Complex Types of Data (II)

--Web Mining--

2

Chapter 8. Mining Complex Types of Data (II)

• Introduction to Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

3

Mining the World-Wide Web• The WWW is huge, widely distributed, global information

service centre for – Information services: news, advertisements, consumer

information, financial management, education, government, e-commerce, etc.

– Rich and dynamic Hyper-link(超连接 ) information

– Access and usage information (WEB页面的访问和使用信息 )

• WWW provides rich sources for data/text mining

• Challenges– Too huge for effective data/text warehousing and mining

– Too complex and heterogeneous: no standards and structure

4

Web Mining: A challenging task • Researches for

– Web access patterns (访问模式 )

– Web structures and regularity

– Web contents

• Problems– The “abundance” problem

– Limited coverage of the Web: hidden Web sources, majority of data in DBMS

– Limited query interface based on keyword-oriented search

– Etc.

5

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Mining Taxonomy

6

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …:Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining



Mining the World-Wide Web

7

Web Mining


Web UsageMining



Web StructureMining

Web ContentMining

Web PageContent Mining Search Result Mining

Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles

8

Web Mining

Web StructureMining

Web ContentMining


Search ResultMining

Web UsageMining

General Access Pattern Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)Uses DM techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.



9

Web Mining

Web UsageMining


Customized Usage Tracking

•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyse access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.


Web StructureMining

Web ContentMining


Search ResultMining

10

Web Mining

Web ContentMining


Search ResultMining

Web UsageMining




Web Structure Mining Using Links•HITS (Kleinberg, 1998)•PageRank (Sergey Brin and Larry Page,1998)

Amount of Web linkage information provides rich information about the relevance, the quality and structure of the Web’s contentUse interconnections between web pages to give weight to pages. .

11




• PageRank

• HITS Approach

• Summary

12

Introduction• Early search engines mainly compare the similarity of the

query and the indexed pages. i.e., – They use information retrieval methods, cosine, ...

• From 1996, it became clear that the similarity alone was no longer sufficient.

• – The number of pages grew rapidly in the mid-late 1990’s.

• Google estimates: 10 million relevant pages.

• How to choose only 30-40 pages and rank them suitably to present to the user?.

13

Web Structure Analysis• Starting around 1996, researchers began to work on the

problem. They resort to hyperlinks （超连接） .

• Web pages on the other hand are connected through hyperlinks, which carry important information. – Some hyperlinks: organize information at the same site. – Other hyperlinks: point to pages from other Web sites. Such out-going

hyperlinks often indicate an implicit conveyance of authority （权威） to the pages being pointed to.

• Those pages that are pointed to by many other pages are likely to contain authoritative information.

14

Web Structure Analysis• During 1997-1998, two most influential hyperlink-based search

algorithms PageRank and HITS were reported. • Both algorithms exploit the hyperlinks of the Web to rank pages •

– PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW) in April, 1998.

– HITS: Jon Kleinberg (Cornell University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998

15


• Introduction Web mining


• PageRank

• HITS Approach

• Summary

16

Background: Social Network Analysis

• Social network: the study of social entities (people in an organization)

- actors (主体 ), their interactions/relationships. • Interactions/relationships: represented by network or graph,

– each vertex (or node): an actor – each link: a relationship.

• From the network, we can study - properties of its structure - actor: the role, position and prestige( 声望 ) • Communities: various kinds of sub-graphs, formed by groups

of actors.

17

Social Network and the Web

• Web: viewed as a virtual social network

– Each page: actor

– each hyperlink: relationship.

• Results from social network can be adapted and extended for use in the Web context.

• Two types of social network analysis,

- centrality and prestige

closely related to hyperlink analysis and search on the Web.

18

Centrality

• An actor with extensive contacts (links) or communications with many other actors in the organization is considered more important than an actor with relatively fewer contacts.

• Central actor: one involved in many links.

19

Measure of Centrality• Network: viewed as a directed graph

• In-links of actor i: links pointing to i

• Out-links of actor i: links pointing out from i

• The simple degree centrality of actor i:

C(i) = dout(i)/(n-1)

where dout(i) the number of out-links of actor i and

n the total number of actors in the network

Dividing n-1 standardizes the centrality value into range [0,1]

20

Prestige • Prestige: more refined measure of prominence of an actor

than centrality.

• Prestigious actor:

one of extensive ties as a recipient used only in-links.

• Difference between centrality and prestige:

– centrality focuses on out-links

– prestige focuses on in-links

21

Measure of Prestige

• In-links of actor i: links pointing to i

• The simple degree Prestige of actor i:

P(i) = din(i)/(n-1)

where din(i) the number of in-links of actor i and

n the total number of actors in the network

22

Rank Prestige • Rank prestige forms the basis of most Web page link analysis

algorithms for PageRank.

• In the real world, a person i chosen by an important person is more prestigious than chosen by a less important person. – For example, if a company CEO votes for a person is much more

important than a worker votes for the person.

• If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. – Thus one’s prestige is affected by the ranks or statuses of the involved

actors.

23

Measure of Rank Prestige• Rank prestige PRank(i): a linear combination of links that point to i:

PRank(i) = A1i PRank(1) + A2iPRank(2) + …+ AniPRank(n)

where Aji =1 if j points to i and 0 otherwise.

• We have n equations for n actors --- mathematically we can write them as the column vector P ：

•

• A: the adjacency matrix of network (graph), where Aij =1 if i points to j and 0 otherwise

n1n PAP T

24

Intuition Idea for Rank Prestige• A hyperlink from a page to another page is an implicit

conveyance of authority to the target page. – The more in-links that a page i receives, the more prestige the page i

has.

• Pages that point to page i also have their own prestige scores. – A page of a higher prestige pointing to i is more important than a page

of a lower prestige pointing to i.

– In other words, a page is important if it is pointed to by other important pages.

• This is exactly the idea of rank prestige in social network.

25

PageRank Algorithm• According to rank prestige, the importance of page i (i’s

PageRank score) is the sum of the PageRank scores of all pages that point to i.

• The Web as a directed graph G = (V, E). Let the total number of pages be n. The PageRank score of the page i (denoted by P(i)) is defined by:

,)(

)(),(

Eij jO

jPiP Oj is the number

of out-link of j

26

Matrix Notation• Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1),

P(2), …, P(n))T.

• Let A be the adjacency matrix of our graph with

Here we use Oi to denote the number of out-links of a node i.

• Each transition probability is 1/Oi if we assume the Web surfer will click the hyperlinks in the page i uniformly at random.

otherwise

EjiifOA

iij

0

),(1

27

Transition Probability Matrix• Let A be the state transition probability matrix

• Aij : the transition probability that the surfer in state i (page i) will move to state j (page j).

nnnn

n

n

AAA

AAA

AAA

...

...

...

...

...

...

.

21

22221

11211

A

28

Let us start…

• Given an initial probability distribution vector that a surfer is at each state (or page)

– p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and

– an nn transition probability matrix A,

we have

n

i

ip1

0 1)(

n

jijA

1

1

29

Random Surfer

• State transition:

• Where Aij(1) is the probability of going from i to j after 1 transition, we can write

• In general, the probability distribution after k steps/transition:

1-kk PAP T

n

iij ipAjp

101 )()1()(

01 PAP T

30

An Example Web Hyperlink Graph

02121000

000000

313103100

000010

00021021

00021210

A

31

Improved PageRank• At a page, the random surfer has two options

– With probability d, he randomly chooses an out-link to follow.– With probability 1-d, he jumps to a random page

• Improved model:

where E is eeT (e is a column vector of all 1’s) and thus E is a nn square matrix of all 1’s.

PAE

P ))1(( Tdn

d

32

Follow the Above Example

061610619061061061

157610619061061061

15761061061061061

061610619061157157

061610611211061157

06161061061157061

)1( Tdn

d AE

33

Final PageRank Algorithm• PageRank for each page i is

PAeP Tdd )1(

n

jji jPAddiP

1

)()1()(

34

Final PageRank Algorithm

• equivalent to the formula given in the PageRank algorithm

• The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank agorithm.

Eij jO

jPddiP

),(

)()1()(

35

Compute PageRank

• Use the iteration method PageRank-Iterate (G)

; k=1; repeat ; k=k+1; until ; return

neP /0

KT

k PdAedP )1(1

kk PP 1

1kP

36

Advantages of PageRank

• PageRank is a global measure and query independent. – PageRank values of all the pages are computed and saved

off-line rather than at the query time.

• Criticism: Query-independence. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

• Nie, et al. Topical Link Analysis for Web Search, SIGIR 2006

37




• PageRank

• HITS Approach

• Summary

38

Another Aim: Web Structure Analysis• Hyperlinks are also useful for finding Web communities.

– A Web community is a cluster of densely linked pages representing a group of people with a special interest.

• Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., – for discovering communities of named entities (e.g., people and

organizations)

– for analyzing social phenomena in emails.

39

Background: Co-citation and Bibliographic Coupling

• An typical area of research concerned with links is citation analysis (引证分析 ) of scholarly publications.

– A scholarly publication cites related prior work to acknowledge the origins of some ideas and to compare the new proposal with existing work.

• When a paper cites another paper, a relationship is established between the publications.

• We discuss two types of citation analysis, co-citation ( 共引证 )and bibliographic coupling (文献联结 ) . The HITS algorithm is related to these two types of analysis.

40

Co-citation

• If papers i and j are both cited by paper k, then they may be related in some sense to one another.

• The more papers they are co-cited by, the stronger their relationship is.

Fig. Paper i and paper j are co-cited by paper k

41

Co-citation （共引证）• Let L be the citation matrix. Each cell of the matrix is defined

as follows:

– Lij = 1 if paper i cites paper j, and 0 otherwise.

• Co-citation (denoted by Cij) is a similarity measure defined as the number of papers that co-cite i and j,

• A square matrix C can be formed with Cij, and it is called the co-citation matrix.

,1

n

kkjkiij LLC

42

Bibliographic Coupling （文献联结） • Bibliographic coupling operates on a similar principle. • Bibliographic coupling links papers that cite the same articles

– if papers i and j both cite paper k, they may be related.• The more papers they both cite, the stronger their similarity is.

Fig. Both paper i and paper j cite paper k

43

Bibliographic Coupling

• Bij represents the number of papers that are cited by both paper i and j

• A bibliographic coupling matrix B (can be formed with Bij) is symmetric and is regarded as a similarity measure of two papers in clustering

,1

n

kjkikij LLB

44

HITS

• HITS --- Hypertext Induced Topic Search.

• HITS is search query dependent for finding Web communities

• When the user issues a search query, – HITS first expands the list of relevant pages returned by a search

engine and

– then produces two rankings of the expanded set of pages, i.e.,

authority pages and hub pages.

45

Authorities and Hubs

Authority: Roughly, an authority is a page with many in-links. – The idea is that the page may have good or authoritative content on

some topic and

– thus many people trust it and link to it.

Hub: A hub is a page with many out-links. – The page serves as an organizer of the information on a particular

topic and

– points to many good authority pages on the topic.

46

Mining the Web's Link Structures• Finding authoritative Web pages(权威页面 )

– Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic

• Hyperlinks( 超连接 ) can infer the notion of authority

– A hyperlink pointing to another Web page, this can be considered as the author’s endorsement(认可 ) of the other page

• Hub pages (Hub页面 ): Web pages that provides collections of links to authorities

47

Mining the Web's Link Structures• Mutually reinforcing relationship( 相互增强关联 ):

a good hub is a page that points to many good authorities;

a good authority is page that is pointed to by many good hubs

…

Authority page (red)

…Hub page(yellow)

Hubs Authorities

48

Define Authority and Hub Weight for Each Page

paFor the page p: authority weight ; hub weight

pq

qp ha

qp

qp ah

q1

q2

q3

page p

a[p]:= sum of h[q],for q, qp

q1

q2

q3

page p

h[p]:= sum of a[q],for q, pq

ph

Better authority (hub) pages with larger a(h)-values

49

The HITS Algorithm

0011

0010

0001

0100

L

d1

d2

d4

d3

• HITS works on the pages in S(web space), and assigns every page in S an authority score and a hub score.

• Let the number of pages in S be n.

• We again use G = (V, E) to denote the hyperlink graph of S.

• We use L to denote the adjacency matrix of the

graph.

otherwise

EddifL ji

ij 0

),(1

50

The HITS Algorithm• Let the authority score of the page i be a(di), and the hub score of page i

be h(di).

• The mutual reinforcing relationship of the two scores is represented as follows:

)(

)()(ij dOUTd

ji dadh

)(

)()(ij dINd

ji dhda

51

HITS in Matrix Form• We use a to denote the column vector with all the authority

scores, a = (a(d1), a(d2), …, a(dn))T, and

• use h to denote the column vector with all the authority scores,

h = (h(d1), h(d2), …, h(dn))T,• Then,

a = LTh

h = La

52

Computation of HITS• The computation of authority scores and hub scores ： using power

iteration （迭代） .

• If we use ak and hk to denote authority and hub vectors at the kth iteration, the iterations for generating the final solutions are

1 kT

k LaLa

1 kT

k hLLh

)1,...,1,1(00 ha

53

Relationships with Co-citation and Bibliographic Coupling

• Recall that co-citation of pages i and j, denoted by Cij, is

– the authority matrix (LTL) of HITS is the co-citation matrix C

• bibliographic coupling of two pages i and j, denoted by Bij is

– the hub matrix (LLT) of HITS is the bibliographic coupling matrix B

ijT

n

kkjkiij LLC )(

1

LL

ijT

n

kjkikij LLB )(

1

LL

54

HITS (Hyperlink-Induced Topic Search)• Explore interactions between hubs and authoritative

pages• Use a term-index search engine to form the root set

– Many of these pages are presumably relevant to the search topic (query)

– Some of them should contain links to most of the prominent authorities

• Expand the root set into a base set– all of the pages that the root-set pages link to, and– all of the pages that link to a page in the root set, up

to a designated size cutoff

55

Root Set (根集 ) and Base Set(基集 )• Properties of base set (ideally)

– Relatively small– Rich in relevant pages– Contain most (many) of the strongest authorities

baseroot

56

Step 1 of HITS: Create Base Set from Root Set Subgraph(, , t, d)

: a query string : a text-based search engine t, d: natural number // t=200; d=50 Let R denote the top t results of on // R root set Set S := R For each page p R // html_content get_url(url) Let W(p) denote the set of all pages p points to Let V(p) denote the set of all pages pointing to p Add all pages in W(p) to S If | V(p) | d, then add all pages in V(p) to S Else add an arbitrary set of d pages from V(p) to S End Return S // S base set : ca.1000 – 5000

57

Step 1 of HITS: Create Base Set from Root Set

For instance,

http://search.yahoo.com/bin/search?p=Data+Mining&ei=UTF-8

http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=21

http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=41

… …

• Two types of links in S:

transverse: between pages with different domain name; intrinsic: between pages with same domain name; (domain name: the first level of URL string of a page)• G: deleting all intrinsic links from S

58

The HITS Algorithm

)(

)()(ij dOUTd

ji dadh

0011

0010

0001

0100

L

aLh

d1

d2

d4

“Adjacency matrix”

d3 Initial values: a=h=1

Iterate

Normalize:

2 2( ) ( ) 1i i

i i

a d h d

)(

)()(ij dINd

ji dhda

hLa T

hLLh T

aLLa T

59

Step 2 of HITS: Calculate Authority and Hub Weight for Each Page

Iterate(G)G : a collection of n linked pages k= 1 Repeat

normalize ak, hk

k=k+1 Until ak and hk do not change significantly

Return (ak, hk).

)1,...,1,1,1(00 ha

1 kT

k LaLa

1 kT

k hLLh

60

Step 3 of HITS: Filter out the top authorities and hubs

Filter(G , c) G : a collection of n linked pages k, c: natural number (xk,yk) := Iterate(G). Report the pages with the c largest coordinates in xk as

authorities. Report the pages with the c largest coordinates in yk as hubs.

61

Strengths and Weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which

may be able to provide more relevant authority and hub pages.

• Weaknesses:

– It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy.

– Inefficiency at query time: The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations

• Reference( 文献 )

Jon M. Kleinberg: Authoritative Sources in a Hyper-linked Environment, Journal of ACM, Vol.46(5), 1999, pp604-632 http://www.cs.cornell.edu/home/kleinber/kleinber.html

62




• PageRank

• HITS Approach

• Summary

63

Summary • Web mining includes mining Web link structures to identify

authoritative Web pages, Web content and Web usage mining

• We introduced

– PageRank & Social network analysis, centrality and prestige

– HITS & Co-citation and bibliographic coupling

64

Summary• Important to note: Hyperlink based ranking is not the only

algorithm used in search engines. In fact, it is combined with many content based factors to produce the final ranking presented to the user.

• Links can also be used to find communities, which are groups of content-creators or people sharing some common interests.

– Web communities

– Email communities

– Named entity communities, etc.

65

Technology

Data.Mining.C.8(Ii).Web Mining 570802461