65
1 Chapter 8. Mining Complex Types of Data (II) --Web Mining--

Data.Mining.C.8(Ii).Web Mining 570802461

Embed Size (px)

Citation preview

Page 1: Data.Mining.C.8(Ii).Web Mining 570802461

1

Chapter 8.

Mining Complex Types of Data (II)

--Web Mining--

Page 2: Data.Mining.C.8(Ii).Web Mining 570802461

2

Chapter 8. Mining Complex Types of Data (II)

• Introduction to Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

Page 3: Data.Mining.C.8(Ii).Web Mining 570802461

3

Mining the World-Wide Web• The WWW is huge, widely distributed, global information

service centre for – Information services: news, advertisements, consumer

information, financial management, education, government, e-commerce, etc.

– Rich and dynamic Hyper-link(超连接 ) information

– Access and usage information (WEB页面的访问和使用信息 )

• WWW provides rich sources for data/text mining

• Challenges– Too huge for effective data/text warehousing and mining

– Too complex and heterogeneous: no standards and structure

Page 4: Data.Mining.C.8(Ii).Web Mining 570802461

4

Web Mining: A challenging task • Researches for

– Web access patterns (访问模式 )

– Web structures and regularity

– Web contents

• Problems– The “abundance” problem

– Limited coverage of the Web: hidden Web sources, majority of data in DBMS

– Limited query interface based on keyword-oriented search

– Etc.

Page 5: Data.Mining.C.8(Ii).Web Mining 570802461

5

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web Mining Taxonomy

Page 6: Data.Mining.C.8(Ii).Web Mining 570802461

6

Web Mining

Web StructureMining

Web ContentMining

Web Page Content MiningWebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …:Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Mining the World-Wide Web

Page 7: Data.Mining.C.8(Ii).Web Mining 570802461

7

Web Mining

Mining the World-Wide Web

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Web StructureMining

Web ContentMining

Web PageContent Mining Search Result Mining

Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles

Page 8: Data.Mining.C.8(Ii).Web Mining 570802461

8

Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General Access Pattern Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)Uses DM techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.

CustomizedUsage Tracking

Mining the World-Wide Web

Page 9: Data.Mining.C.8(Ii).Web Mining 570802461

9

Web Mining

Web UsageMining

General AccessPattern Tracking

Customized Usage Tracking

•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyse access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.

Mining the World-Wide Web

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Page 10: Data.Mining.C.8(Ii).Web Mining 570802461

10

Web Mining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Mining the World-Wide Web

Web Structure Mining Using Links•HITS (Kleinberg, 1998)•PageRank (Sergey Brin and Larry Page,1998)

Amount of Web linkage information provides rich information about the relevance, the quality and structure of the Web’s contentUse interconnections between web pages to give weight to pages. .

Page 11: Data.Mining.C.8(Ii).Web Mining 570802461

11

Chapter 8. Mining Complex Types of Data (II)

• Introduction to Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

Page 12: Data.Mining.C.8(Ii).Web Mining 570802461

12

Introduction• Early search engines mainly compare the similarity of the

query and the indexed pages. i.e., – They use information retrieval methods, cosine, ...

• From 1996, it became clear that the similarity alone was no longer sufficient.

• – The number of pages grew rapidly in the mid-late 1990’s.

• Google estimates: 10 million relevant pages.

• How to choose only 30-40 pages and rank them suitably to present to the user?.

Page 13: Data.Mining.C.8(Ii).Web Mining 570802461

13

Web Structure Analysis• Starting around 1996, researchers began to work on the

problem. They resort to hyperlinks (超连接) .

• Web pages on the other hand are connected through hyperlinks, which carry important information. – Some hyperlinks: organize information at the same site. – Other hyperlinks: point to pages from other Web sites. Such out-going

hyperlinks often indicate an implicit conveyance of authority (权威) to the pages being pointed to.

• Those pages that are pointed to by many other pages are likely to contain authoritative information.

Page 14: Data.Mining.C.8(Ii).Web Mining 570802461

14

Web Structure Analysis• During 1997-1998, two most influential hyperlink-based search

algorithms PageRank and HITS were reported. • Both algorithms exploit the hyperlinks of the Web to rank pages •

– PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW) in April, 1998.

– HITS: Jon Kleinberg (Cornell University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998

Page 15: Data.Mining.C.8(Ii).Web Mining 570802461

15

Chapter 8. Mining Complex Types of Data (II)

• Introduction Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

Page 16: Data.Mining.C.8(Ii).Web Mining 570802461

16

Background: Social Network Analysis

• Social network: the study of social entities (people in an organization)

- actors (主体 ), their interactions/relationships. • Interactions/relationships: represented by network or graph,

– each vertex (or node): an actor – each link: a relationship.

• From the network, we can study - properties of its structure - actor: the role, position and prestige( 声望 ) • Communities: various kinds of sub-graphs, formed by groups

of actors.

Page 17: Data.Mining.C.8(Ii).Web Mining 570802461

17

Social Network and the Web

• Web: viewed as a virtual social network

– Each page: actor

– each hyperlink: relationship.

• Results from social network can be adapted and extended for use in the Web context.

• Two types of social network analysis,

- centrality and prestige

closely related to hyperlink analysis and search on the Web.

Page 18: Data.Mining.C.8(Ii).Web Mining 570802461

18

Centrality

• An actor with extensive contacts (links) or communications with many other actors in the organization is considered more important than an actor with relatively fewer contacts.

• Central actor: one involved in many links.

Page 19: Data.Mining.C.8(Ii).Web Mining 570802461

19

Measure of Centrality• Network: viewed as a directed graph

• In-links of actor i: links pointing to i

• Out-links of actor i: links pointing out from i

• The simple degree centrality of actor i:

C(i) = dout(i)/(n-1)

where dout(i) the number of out-links of actor i and

n the total number of actors in the network

Dividing n-1 standardizes the centrality value into range [0,1]

Page 20: Data.Mining.C.8(Ii).Web Mining 570802461

20

Prestige • Prestige: more refined measure of prominence of an actor

than centrality.

• Prestigious actor:

one of extensive ties as a recipient used only in-links.

• Difference between centrality and prestige:

– centrality focuses on out-links

– prestige focuses on in-links

Page 21: Data.Mining.C.8(Ii).Web Mining 570802461

21

Measure of Prestige

• In-links of actor i: links pointing to i

• The simple degree Prestige of actor i:

P(i) = din(i)/(n-1)

where din(i) the number of in-links of actor i and

n the total number of actors in the network

Page 22: Data.Mining.C.8(Ii).Web Mining 570802461

22

Rank Prestige • Rank prestige forms the basis of most Web page link analysis

algorithms for PageRank.

• In the real world, a person i chosen by an important person is more prestigious than chosen by a less important person. – For example, if a company CEO votes for a person is much more

important than a worker votes for the person.

• If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. – Thus one’s prestige is affected by the ranks or statuses of the involved

actors.

Page 23: Data.Mining.C.8(Ii).Web Mining 570802461

23

Measure of Rank Prestige• Rank prestige PRank(i): a linear combination of links that point to i:

PRank(i) = A1i PRank(1) + A2iPRank(2) + …+ AniPRank(n)

where Aji =1 if j points to i and 0 otherwise.

• We have n equations for n actors --- mathematically we can write them as the column vector P :

• A: the adjacency matrix of network (graph), where Aij =1 if i points to j and 0 otherwise

n1n PAP T

Page 24: Data.Mining.C.8(Ii).Web Mining 570802461

24

Intuition Idea for Rank Prestige• A hyperlink from a page to another page is an implicit

conveyance of authority to the target page. – The more in-links that a page i receives, the more prestige the page i

has.

• Pages that point to page i also have their own prestige scores. – A page of a higher prestige pointing to i is more important than a page

of a lower prestige pointing to i.

– In other words, a page is important if it is pointed to by other important pages.

• This is exactly the idea of rank prestige in social network.

Page 25: Data.Mining.C.8(Ii).Web Mining 570802461

25

PageRank Algorithm• According to rank prestige, the importance of page i (i’s

PageRank score) is the sum of the PageRank scores of all pages that point to i.

• The Web as a directed graph G = (V, E). Let the total number of pages be n. The PageRank score of the page i (denoted by P(i)) is defined by:

,)(

)(),(

Eij jO

jPiP Oj is the number

of out-link of j

Page 26: Data.Mining.C.8(Ii).Web Mining 570802461

26

Matrix Notation• Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1),

P(2), …, P(n))T.

• Let A be the adjacency matrix of our graph with

Here we use Oi to denote the number of out-links of a node i.

• Each transition probability is 1/Oi if we assume the Web surfer will click the hyperlinks in the page i uniformly at random.

otherwise

EjiifOA

iij

0

),(1

Page 27: Data.Mining.C.8(Ii).Web Mining 570802461

27

Transition Probability Matrix• Let A be the state transition probability matrix

• Aij : the transition probability that the surfer in state i (page i) will move to state j (page j).

nnnn

n

n

AAA

AAA

AAA

...

...

...

...

...

...

.

21

22221

11211

A

Page 28: Data.Mining.C.8(Ii).Web Mining 570802461

28

Let us start…

• Given an initial probability distribution vector that a surfer is at each state (or page)

– p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and

– an nn transition probability matrix A,

we have

n

i

ip1

0 1)(

n

jijA

1

1

Page 29: Data.Mining.C.8(Ii).Web Mining 570802461

29

Random Surfer

• State transition:

• Where Aij(1) is the probability of going from i to j after 1 transition, we can write

• In general, the probability distribution after k steps/transition:

1-kk PAP T

n

iij ipAjp

101 )()1()(

01 PAP T

Page 30: Data.Mining.C.8(Ii).Web Mining 570802461

30

An Example Web Hyperlink Graph

02121000

000000

313103100

000010

00021021

00021210

A

Page 31: Data.Mining.C.8(Ii).Web Mining 570802461

31

Improved PageRank• At a page, the random surfer has two options

– With probability d, he randomly chooses an out-link to follow.– With probability 1-d, he jumps to a random page

• Improved model:

where E is eeT (e is a column vector of all 1’s) and thus E is a nn square matrix of all 1’s.

PAE

P ))1(( Tdn

d

Page 32: Data.Mining.C.8(Ii).Web Mining 570802461

32

Follow the Above Example

061610619061061061

157610619061061061

15761061061061061

061610619061157157

061610611211061157

06161061061157061

)1( Tdn

d AE

Page 33: Data.Mining.C.8(Ii).Web Mining 570802461

33

Final PageRank Algorithm• PageRank for each page i is

PAeP Tdd )1(

n

jji jPAddiP

1

)()1()(

Page 34: Data.Mining.C.8(Ii).Web Mining 570802461

34

Final PageRank Algorithm

• equivalent to the formula given in the PageRank algorithm

• The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank agorithm.

Eij jO

jPddiP

),(

)()1()(

Page 35: Data.Mining.C.8(Ii).Web Mining 570802461

35

Compute PageRank

• Use the iteration method PageRank-Iterate (G)

; k=1; repeat ; k=k+1; until ; return

neP /0

KT

k PdAedP )1(1

kk PP 1

1kP

Page 36: Data.Mining.C.8(Ii).Web Mining 570802461

36

Advantages of PageRank

• PageRank is a global measure and query independent. – PageRank values of all the pages are computed and saved

off-line rather than at the query time.

• Criticism: Query-independence. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

• Nie, et al. Topical Link Analysis for Web Search, SIGIR 2006

Page 37: Data.Mining.C.8(Ii).Web Mining 570802461

37

Chapter 8. Mining Complex Types of Data (II)

• Introduction to Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

Page 38: Data.Mining.C.8(Ii).Web Mining 570802461

38

Another Aim: Web Structure Analysis• Hyperlinks are also useful for finding Web communities.

– A Web community is a cluster of densely linked pages representing a group of people with a special interest.

• Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., – for discovering communities of named entities (e.g., people and

organizations)

– for analyzing social phenomena in emails.

Page 39: Data.Mining.C.8(Ii).Web Mining 570802461

39

Background: Co-citation and Bibliographic Coupling

• An typical area of research concerned with links is citation analysis (引证分析 ) of scholarly publications.

– A scholarly publication cites related prior work to acknowledge the origins of some ideas and to compare the new proposal with existing work.

• When a paper cites another paper, a relationship is established between the publications.

• We discuss two types of citation analysis, co-citation ( 共引证 )and bibliographic coupling (文献联结 ) . The HITS algorithm is related to these two types of analysis.

Page 40: Data.Mining.C.8(Ii).Web Mining 570802461

40

Co-citation

• If papers i and j are both cited by paper k, then they may be related in some sense to one another.

• The more papers they are co-cited by, the stronger their relationship is.

Fig. Paper i and paper j are co-cited by paper k

Page 41: Data.Mining.C.8(Ii).Web Mining 570802461

41

Co-citation (共引证)• Let L be the citation matrix. Each cell of the matrix is defined

as follows:

– Lij = 1 if paper i cites paper j, and 0 otherwise.

• Co-citation (denoted by Cij) is a similarity measure defined as the number of papers that co-cite i and j,

• A square matrix C can be formed with Cij, and it is called the co-citation matrix.

,1

n

kkjkiij LLC

Page 42: Data.Mining.C.8(Ii).Web Mining 570802461

42

Bibliographic Coupling (文献联结) • Bibliographic coupling operates on a similar principle. • Bibliographic coupling links papers that cite the same articles

– if papers i and j both cite paper k, they may be related.• The more papers they both cite, the stronger their similarity is.

Fig. Both paper i and paper j cite paper k

Page 43: Data.Mining.C.8(Ii).Web Mining 570802461

43

Bibliographic Coupling

• Bij represents the number of papers that are cited by both paper i and j

• A bibliographic coupling matrix B (can be formed with Bij) is symmetric and is regarded as a similarity measure of two papers in clustering

,1

n

kjkikij LLB

Page 44: Data.Mining.C.8(Ii).Web Mining 570802461

44

HITS

• HITS --- Hypertext Induced Topic Search.

• HITS is search query dependent for finding Web communities

• When the user issues a search query, – HITS first expands the list of relevant pages returned by a search

engine and

– then produces two rankings of the expanded set of pages, i.e.,

authority pages and hub pages.

Page 45: Data.Mining.C.8(Ii).Web Mining 570802461

45

Authorities and Hubs

Authority: Roughly, an authority is a page with many in-links. – The idea is that the page may have good or authoritative content on

some topic and

– thus many people trust it and link to it.

Hub: A hub is a page with many out-links. – The page serves as an organizer of the information on a particular

topic and

– points to many good authority pages on the topic.

Page 46: Data.Mining.C.8(Ii).Web Mining 570802461

46

Mining the Web's Link Structures• Finding authoritative Web pages(权威页面 )

– Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic

• Hyperlinks( 超连接 ) can infer the notion of authority

– A hyperlink pointing to another Web page, this can be considered as the author’s endorsement(认可 ) of the other page

• Hub pages (Hub页面 ): Web pages that provides collections of links to authorities

Page 47: Data.Mining.C.8(Ii).Web Mining 570802461

47

Mining the Web's Link Structures• Mutually reinforcing relationship( 相互增强关联 ):

a good hub is a page that points to many good authorities;

a good authority is page that is pointed to by many good hubs

Authority page (red)

…Hub page(yellow)

Hubs Authorities

Page 48: Data.Mining.C.8(Ii).Web Mining 570802461

48

Define Authority and Hub Weight for Each Page

paFor the page p: authority weight ; hub weight

pq

qp ha

qp

qp ah

q1

q2

q3

page p

a[p]:= sum of h[q],for q, qp

q1

q2

q3

page p

h[p]:= sum of a[q],for q, pq

ph

Better authority (hub) pages with larger a(h)-values

Page 49: Data.Mining.C.8(Ii).Web Mining 570802461

49

The HITS Algorithm

0011

0010

0001

0100

L

d1

d2

d4

d3

• HITS works on the pages in S(web space), and assigns every page in S an authority score and a hub score.

• Let the number of pages in S be n.

• We again use G = (V, E) to denote the hyperlink graph of S.

• We use L to denote the adjacency matrix of the

graph.

otherwise

EddifL ji

ij 0

),(1

Page 50: Data.Mining.C.8(Ii).Web Mining 570802461

50

The HITS Algorithm• Let the authority score of the page i be a(di), and the hub score of page i

be h(di).

• The mutual reinforcing relationship of the two scores is represented as follows:

)(

)()(ij dOUTd

ji dadh

)(

)()(ij dINd

ji dhda

Page 51: Data.Mining.C.8(Ii).Web Mining 570802461

51

HITS in Matrix Form• We use a to denote the column vector with all the authority

scores, a = (a(d1), a(d2), …, a(dn))T, and

• use h to denote the column vector with all the authority scores,

h = (h(d1), h(d2), …, h(dn))T,• Then,

a = LTh

h = La

Page 52: Data.Mining.C.8(Ii).Web Mining 570802461

52

Computation of HITS• The computation of authority scores and hub scores : using power

iteration (迭代) .

• If we use ak and hk to denote authority and hub vectors at the kth iteration, the iterations for generating the final solutions are

1 kT

k LaLa

1 kT

k hLLh

)1,...,1,1(00 ha

Page 53: Data.Mining.C.8(Ii).Web Mining 570802461

53

Relationships with Co-citation and Bibliographic Coupling

• Recall that co-citation of pages i and j, denoted by Cij, is

– the authority matrix (LTL) of HITS is the co-citation matrix C

• bibliographic coupling of two pages i and j, denoted by Bij is

– the hub matrix (LLT) of HITS is the bibliographic coupling matrix B

ijT

n

kkjkiij LLC )(

1

LL

ijT

n

kjkikij LLB )(

1

LL

Page 54: Data.Mining.C.8(Ii).Web Mining 570802461

54

HITS (Hyperlink-Induced Topic Search)• Explore interactions between hubs and authoritative

pages• Use a term-index search engine to form the root set

– Many of these pages are presumably relevant to the search topic (query)

– Some of them should contain links to most of the prominent authorities

• Expand the root set into a base set– all of the pages that the root-set pages link to, and– all of the pages that link to a page in the root set, up

to a designated size cutoff

Page 55: Data.Mining.C.8(Ii).Web Mining 570802461

55

Root Set (根集 ) and Base Set(基集 )• Properties of base set (ideally)

– Relatively small– Rich in relevant pages– Contain most (many) of the strongest authorities

baseroot

Page 56: Data.Mining.C.8(Ii).Web Mining 570802461

56

Step 1 of HITS: Create Base Set from Root Set Subgraph(, , t, d)

: a query string : a text-based search engine t, d: natural number // t=200; d=50 Let R denote the top t results of on // R root set Set S := R For each page p R // html_content get_url(url) Let W(p) denote the set of all pages p points to Let V(p) denote the set of all pages pointing to p Add all pages in W(p) to S If | V(p) | d, then add all pages in V(p) to S Else add an arbitrary set of d pages from V(p) to S End Return S // S base set : ca.1000 – 5000

Page 57: Data.Mining.C.8(Ii).Web Mining 570802461

57

Step 1 of HITS: Create Base Set from Root Set

For instance,

http://search.yahoo.com/bin/search?p=Data+Mining&ei=UTF-8

http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=21

http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=41

… …

• Two types of links in S:

transverse: between pages with different domain name; intrinsic: between pages with same domain name; (domain name: the first level of URL string of a page)• G: deleting all intrinsic links from S

Page 58: Data.Mining.C.8(Ii).Web Mining 570802461

58

The HITS Algorithm

)(

)()(ij dOUTd

ji dadh

0011

0010

0001

0100

L

aLh

d1

d2

d4

“Adjacency matrix”

d3 Initial values: a=h=1

Iterate

Normalize:

2 2( ) ( ) 1i i

i i

a d h d

)(

)()(ij dINd

ji dhda

hLa T

hLLh T

aLLa T

Page 59: Data.Mining.C.8(Ii).Web Mining 570802461

59

Step 2 of HITS: Calculate Authority and Hub Weight for Each Page

Iterate(G)G : a collection of n linked pages k= 1 Repeat

normalize ak, hk

k=k+1 Until ak and hk do not change significantly

Return (ak, hk).

)1,...,1,1,1(00 ha

1 kT

k LaLa

1 kT

k hLLh

Page 60: Data.Mining.C.8(Ii).Web Mining 570802461

60

Step 3 of HITS: Filter out the top authorities and hubs

Filter(G , c) G : a collection of n linked pages k, c: natural number (xk,yk) := Iterate(G). Report the pages with the c largest coordinates in xk as

authorities. Report the pages with the c largest coordinates in yk as hubs.

Page 61: Data.Mining.C.8(Ii).Web Mining 570802461

61

Strengths and Weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which

may be able to provide more relevant authority and hub pages.

• Weaknesses:

– It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy.

– Inefficiency at query time: The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations

• Reference( 文献 )

Jon M. Kleinberg: Authoritative Sources in a Hyper-linked Environment, Journal of ACM, Vol.46(5), 1999, pp604-632 http://www.cs.cornell.edu/home/kleinber/kleinber.html

Page 62: Data.Mining.C.8(Ii).Web Mining 570802461

62

Chapter 8. Mining Complex Types of Data (II)

• Introduction to Web mining

• Web Structure Analysis

• PageRank

• HITS Approach

• Summary

Page 63: Data.Mining.C.8(Ii).Web Mining 570802461

63

Summary • Web mining includes mining Web link structures to identify

authoritative Web pages, Web content and Web usage mining

• We introduced

– PageRank & Social network analysis, centrality and prestige

– HITS & Co-citation and bibliographic coupling

Page 64: Data.Mining.C.8(Ii).Web Mining 570802461

64

Summary• Important to note: Hyperlink based ranking is not the only

algorithm used in search engines. In fact, it is combined with many content based factors to produce the final ranking presented to the user.

• Links can also be used to find communities, which are groups of content-creators or people sharing some common interests.

– Web communities

– Email communities

– Named entity communities, etc.

Page 65: Data.Mining.C.8(Ii).Web Mining 570802461

65