43
Knowledge Management Institute Link Analysis and Link Analysis and Decentralized Search Markus Strohmaier, Denis Helic M lti di l If ti t II Multimediale Informationssysteme II 1 Markus Strohmaier 2011

Link Analysis andLink Analysis and Decentralized Search · Link Analysis andLink Analysis and Decentralized Search ... Social Computation ... avg. shortest path > log 2(9) 19

  • Upload
    votuong

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

Knowledge Management Institute

Link Analysis andLink Analysis and Decentralized Search

Markus Strohmaier, Denis HelicM lti di l I f ti t IIMultimediale Informationssysteme II

1

Markus Strohmaier 2011

Knowledge Management Institute

The Memex (1945)

A mechanized private libraryfor individual use

Mi i i ti

The Memex [Bush 1945]:

Mimics associative memorywhere users can

insert documentsnavigate documentsretrieve documentsbuild trails through documents

Operated and maintained

B

(i) (ii) pindividually

But trails can be shared socially

(i) (ii)

ye.g.

(i) a user A can send trail to user B(ii) user B modifies and shares it with user C(iii) user C uses the trail for navigation

A C(iii)

C‘s interaction with documents

2

Markus Strohmaier 2011

( ) g

[Bush 1945] V. Bush. As We May Think. Atlantic Monthly, 1945.

C s interaction with documents is mediated by user A and B

Knowledge Management Institute

Web based Retrieval: Challenges

W ki ith t f d tWorking with an enormous amount of data10 billion pages a 500kB estimated in 01-2004

2 / th l b– 2 pages / person on the globe

20 times larger than the LoC print collection– estimated in 2003estimated in 2003

Furthermore there is a Deep Web– 550 billion pages estimated in 2004p g

3

Markus Strohmaier 2011

Knowledge Management Institute

Web based Retrieval: Challenges

E l f th t f bExample for the amount of web pages:– Searching for ‘Star Trek’ yielded about 11 million of results on

Google [Nov 2007]g [ ]– Ordinary users investigate 20-30 result list entries.

What web page is the most interesting?How to store an index (inverted file) with this size?

4

Markus Strohmaier 2011

Knowledge Management Institute

Web based Retrieval: Challenges

Th W b i hi hl d iThe Web is highly dynamicStudy by Cho & Garcia-Molina (2002):

40% f th b h d th i d t t ithi k– 40% of the web pages changed their dataset within a week– 23% of the .com pages changed on daily basis

Study by Fetterly et al (2003):Study by Fetterly et al. (2003):– 35 % of the pages changed during investigations– Larger web pages change more often

5

Markus Strohmaier 2011

Knowledge Management Institute

Web based Retrieval: Challenges

Th W b i lf i dThe Web is self-organizedNo central authority (for the WWW) or main indexEveryone can add (even edit) pagesEveryone can add (even edit) pagesPages disappear on regular basis

– A US study claimed that in 2 investigated tech. journals 50% of the y g jcited links were inaccessible after four years.

Lots of errors and falsehood, no quality control

6

Markus Strohmaier 2011

Knowledge Management Institute

Web based Retrieval: Challenges

Th W b i h li k dThe Web is hyperlinked

Based on HTML Markup tags and URIsPages are interconnected

U idi ti l li k (i li k t li k lf li k)– Unidirectional links (in-link, out-link, self-link)

Network structures emerge from the linksLink analysis is possible– Link analysis is possible

7

Markus Strohmaier 2011

Knowledge Management Institute

Common Architecture

8

Markus Strohmaier 2011

Knowledge Management Institute

The World Wide Web (1990-2000)

A user‘s interaction with the web is

mediated by (a few) editors and publishers

9

Markus Strohmaier 2011

Knowledge Management Institute

The World Wide Web Today (2010)

Interaction between individuals andcomputational systems is mediated by thecomputational systems is mediated by theaggregate behavior of massive numbers

(millions) of users.

10

Markus Strohmaier 2011

Knowledge Management Institute

Social Computation pinfluences system properties (X)

X=UtilityX=Findability

Emergent system properties are beyond the direct control of engineers.

N h d d l i h f d i i d h i i lNew methods and algorithms for designing and shaping social-computational systems are needed.

X Navigability X R l

It is through the process of social computation, i.e. the combination of social behavior and algorithmic computation, that desired and undesired system properties and functions emerge.X=Navigability X=Relevance

y p p g

11

Markus Strohmaier 2011

Knowledge Management Institute

Example:Example:X = Connectivity (of the web graph)

Questions:• What is X like? • What causes X?

bow-tie architectureof the web

12

Markus Strohmaier 2011

[Broder et al 2000]

Knowledge Management Institute

Example:Example:X = Connectivity (of the web graph)

• What causes X? • How can we Questions:• What is X like?

improve X?bow-tie architectureof the web

Social mechanisms, such as preferential attachment

an open problem

Degree of vertex i

The sum of all

vertices‘

Preferential attachment:

problem e cesdegrees

Probability of a new vertex attaching to a vertex i with degree k

13

Markus Strohmaier 2011

[Broder et al. 2000] [Barabasi and Albert 1999]

[Barabasi and Albert 1999] A.-L. Barabási, R. Albert, Emergence of Scaling in Random Networks, Vol. 286. no. 5439, pp. 509 - 512 , Science, 15 October 1999. [Broder et al. 2000] A. Broder, R. Kumar, F. Maghoul, P.Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure on the web. In 9th International WWW Conference, 2000.

vertex i with degree k

Knowledge Management Institute

Analysis of Dynamic Links in Social Tagging Systemsin Social Tagging Systems

How can navigability in social tagging t b d ib d d i d?systems be described and improved?

D. Helic, C. Trattner, M. Strohmaier and K. Andrews, On the Navigability of Social Tagging Systems, The 2nd IEEE International Conference on Social Computing (SocialCom 2010) Minneapolis Minnesota USA 2010

14

Markus Strohmaier 2011

IEEE International Conference on Social Computing (SocialCom 2010), Minneapolis, Minnesota, USA, 2010. (acceptance rate 33/245, 13,47% quota, nominated for Best Paper).

Knowledge Management Institute

Structure of Social Tagging Systems: DefinitionTags

User

ResourcesTags

User

A folksonomy is a tuple F:= (U, T, R, Y) whereth th di j i t fi it t U T R d t user 1• the three disjoint, finite sets U, T, R correspond to – a set of persons or users u ∈ U – a set of tags t ∈ T and

user 1

– a set of resources or objects r ∈ R

• Y ⊆ U ×T ×R, called set of tag assignmentstag 1 res. 1

navigation

15

Markus Strohmaier 2011

[Hotho et al 2006]

Knowledge Management Institute

Tag Clouds are Assumed to beTag Clouds are Assumed to be Efficient Tools for Navigation

The Navigability Assumption:• An implicit assumption among designers of social tagging

web

An implicit assumption among designers of social taggingsystems that tag clouds are specifically useful to support navigation.

• This has hardly been tested or critically reflected in the pastThis has hardly been tested or critically reflected in the past.

Navigating tagging systems via tag clouds:1. The system presents a tag cloud to the user.2. The user selects a tag from the tag cloud.3. The system presents a list of resources tagged with the

selected tag.4. The user selects a resource from the list of resources.5. The system transfers the user to the selected resource,

and the process potentially starts anew.

16

Markus Strohmaier 2011

Navigating Y ⊆ T × R

Knowledge Management Institute

Navigability

Informal Description: If / how quick one can get from document A to

d t B i h t t tdocument B in a hypertext system(more precise definition follows later)

Designing for Navigability:In traditional hypertext systems, this property used to beyp y , p p y

within the control of system designers

17

Markus Strohmaier 2011

Knowledge Management Institute

Defining Navigability

A network is navigable iff:There is a path between all or almost all pairs of nodes

i th t kin the network. [Kleinberg 1999]

Formally:Formally:1. There exists a giant component: size(GC) > 0.9 * n

a single connected component that accounts for a significant fraction of all nodesg p g

2. The effective diameter deff is low: deff < log ndeff = distance at which 90% of pairs of nodes are reachable

n number of nodes in

18

Markus Strohmaier 2011[Kleinberg 1999] J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000.

n… number of nodes in the network

Knowledge Management Institute

Navigability: Examples

Example 1:

Not navigable: No giant component

Example 2:

Not navigable: giant component BUTNot navigable: giant component, BUTavg. shortest path > log2(9)

19

Markus Strohmaier 2011

Knowledge Management Institute

Navigability: Examples

Example 3:

Navigable: Giant component AND avg shortest path ≤ 2 < log (9)avg. shortest path ≤ 2 < log2(9)

Is this efficiently navigable?Is this efficiently navigable? There are short paths between all nodes, but can an

agent or algorithm find them with local knowledge

20

Markus Strohmaier 2011

only?

Knowledge Management Institute

Efficiently navigable

A network is efficiently navigable iff:If there is an algorithm that can find a short path with

l l l k l d d th d li ti f thonly local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).

Example 4:B

Example 4:

A C

Efficiently navigable, if the algorithm knows it needs togo through A B C

21

Markus Strohmaier 2011J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

Knowledge Management Institute

Navigability of Social Tagging SystemsDatasets Annotations ResourcesBut: how does

(i) th i f t l d dAustria Forum 32,245 12,837

Bibsonomy 916,495 235,339

CiteULike 6,328,021 1,697,365

(i) the size of tag clouds and(ii) number of resources / tag influence the navigability (X1) of social tagging systems?

established systems,many users Navigable inmany users New system,

few usersNavigable in theory: GC exists, low eff. diameter

Shrinking diameter over time, cf.[Leskovec et al. 2005]

22

Markus Strohmaier 2011

(for Y ⊆ T × R)

[Leskovec et al. 2005] J. Leskovec, J.M. Kleinberg, C. Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187

Knowledge Management Institute

Modeling UI constraints

Tag Cloud Size n

number of n tags displayed per resource(with a topN algorithm)

Pagination of resources / tag

number of k resources displayed per tag(with reverse chronological ordering)

23

Markus Strohmaier 2011

Knowledge Management Institute

How UI constraints effect NavigabilityTag Cloud Size

Pagination

Limiting the tag cloud size n to practically feasible sizes (e.g. 5, 10, or more) does not influence navigability (this is not very surprising).

BUT: Limiting the out-degree of high frequency tags k (e.g. through pagination with resources sorted in reverse-chronological order) leaves the network vulnerable to fragmentation. This destroys navigability of prevalent approaches

24

Markus Strohmaier 2011

vulnerable to fragmentation. This destroys navigability of prevalent approaches to tag clouds.

Knowledge Management Institute

Findings1 F t i ifi b t l t l d i th1. For certain specific, but popular, tag cloud scenarios, the

so-called Navigability Assumption does not hold. 2. While we could confirm that tag-resource networks have g

efficient navigational properties in theory, we found that popular user interface decisions significantly impair navigabilitynavigability.

These results make a theoretical and an empirical argument against existing approaches to tag cloud construction.

How can we recover navigability of social taggingHow can we recover navigability of social tagging systems?

25

Markus Strohmaier 2011

Knowledge Management Institute

Recovering Navigability in Social TaggingRecovering Navigability in Social Tagging Systems

Instead of reverse-chronological ordering of resources, we apply a naive random ordering.

Based on this observation, we have developed ordering algorithms thatbalance semantic and navigational aspects e g [Trattner et al 2010]

26

Markus Strohmaier 2011

balance semantic and navigational aspects, e.g. [Trattner et al. 2010]

[Trattner et al. 2010] C. Trattner, M. Strohmaier, and D. Helic. Improving navigability of hierarchically-structured encyclopedias through effective tag cloud construction. In 10th International Conference on Knowledge Management and Knowledge Technologies I-KNOW 2010, Graz, Austria, 2010.

Knowledge Management Institute

Navigating Networksg g

How can model user navigation on networks?

27

Markus Strohmaier 2011

Knowledge Management Institute

Experiment [Milgram]GoalGoal• Define a single target person and a group of starting persons• Generate an acquaintance chain from each starter to the targetE i t l S t U

1933-1984

Experimental Set Up• Each starter receives a document• was asked to begin moving it by mail toward the target• Information about the target: name, address, occupation, company,

college, year of graduation, wife’s name and hometown• Information about relationship (friend/acquaintance) [Granovetter 1973]Constraints• starter group was only allowed to send the document to people they

know and • was urged to choose the next recipient in a way as to advance the

progress of the document toward the target

28

Markus Strohmaier 2011

Knowledge Management Institute

Introduction

The simplest way of formulating the small-world problem is: Starting with any two people in the world, what is the likelihood that they will know each other?likelihood that they will know each other?

A somewhat more sophisticated formulation, however, takes account of the fact that while person X and Z may not knowaccount of the fact that while person X and Z may not know each other directly, they may share a mutual acquaintance -that is, a person who knows both of them. One can then think of an acquaintance chain with X knowing Y and Y knowing Zan acquaintance chain with X knowing Y and Y knowing Z. Moreover, one can imagine circumstances in which X is linked to Z not by a single link, but by a series of links, X-A-B-C-D…Y-Z. That is to say, person X knows person A who in turn knowsZ. That is to say, person X knows person A who in turn knows person B, who knows C… who knows Y, who knows Z.

[Milgram 1967, according to]http://www.ils.unc.edu/dpr/port/socialnetworking/theory_paper.html#2]

29

Markus Strohmaier 2011

Knowledge Management Institute

An Experimental Study of the Small WorldAn Experimental Study of the Small World Problem [Travers and Milgram 1969]

A S i l N t k E i t t il d t dA Social Network Experiment tailored towards• Demonstrating• Defining• And measuringInter-connectedness in a large society (USA)

A test of the modern idea of “six degrees of separation”Which states that: every person on earth is

fconnected to any other person through a chain of acquaintances not longer than 6

30

Markus Strohmaier 2011

Knowledge Management Institute

Results I

H f th t t ld b bl t t bli h• How many of the starters would be able to establish contact with the target?– 64 out of 296 reached the target– 64 out of 296 reached the target

• How many intermediaries would be required to link starters with the target?g– Well, that depends: the overall mean 5.2 links– Through hometown: 6.1 links

Th h b i 4 6 li k– Through business: 4.6 links– Boston group faster than Nebraska groups– Nebraska stockholders not faster than Nebraska random

• What form would the distribution of chain lengths take?

31

Markus Strohmaier 2011

Knowledge Management Institute

Results III .

C th• Common paths• Also see:

Gladwell’s “Law of the few”Gladwell’s “Law of the few”

32

Markus Strohmaier 2011

Knowledge Management Institute

F ll k (2008)Follow up work (2008)http://arxiv.org/PS_cache/arxiv/pdf/0803/0803.0939v1.pdf

H it d L k t d 2008– Horvitz and Leskovec study 2008– 30 billion conversations among 240 million people of Microsoft

Messenger– Communication graph with 180 million nodes and 1.3 billion

undirected edges– Largest social network constructed and analyzed to date (2008)g y ( )

33

Markus Strohmaier 2011

Knowledge Management Institute

Decentralized SearchDecentralized Search

Background knowledge:

Idea: use folksonomies as background knowledge Then, the performance of decentralized search

Shortest path to targetBackground knowledge:(a tag hierarchy)

g gpdepends on the suitability of folksonomies.

In other words, we can evaluate the suitability of

Folksonomy1

Folksonomy...

Folksonomyn

folksonomies for decentralized search through simulations.

shortest path found with l l k l d 4A (tag-tag) network:

Goal: Navigate from START to TARGETusing local and background knowledge

local knowledge pLK = 4

Δ = pLK-pGK

start target

using local and background knowledge only

shortest path with candidates

34

Markus Strohmaier 2011J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)

pglobal knowledge pGK = 3

Knowledge Management Institute

Evaluating Hierarchical Structures in NetworksNetworks

How can measure the efficiency of hi hi l t t f i ti ?hierarchical structures for navigation?

35

Markus Strohmaier 2011

Knowledge Management Institute

The World Wide Web (1990-2000)

How efficient is this as a navigational aid?

36

Markus Strohmaier 2011

Knowledge Management Institute

Construction of hierarchies fromConstruction of hierarchies from unstructured tagging data

F t t lit t t litFrom tag centrality to tag generality:high tag centrality:more abstract

low tag centrality:more specific

Other existing folksonomy algorithms: k-means, affinity propagation, …

37

Markus Strohmaier 2011

[Heyman and Garcia-Molina 2006]

Knowledge Management Institute

Evaluation Framework

Decentralized SearchSearch

Folksonomy 1 Simulation Performance

Evaluationwhich folksonomy performs best on a given navigational task

Click-Data

Evaluation

Explanatory E l ti

which folksonomy explains actual user behavior best

navigational task

Folksonomy …

Evaluation actual user behavior best

Folksonomy n

38

Markus Strohmaier 2011

Knowledge Management Institute

Success Rates Across Different FolksonomiesTag generalityflickr dataset

k-means /

Tag generality approaches

Random

affinity propagation

Random folksonomy

Success rate: The number of times an agent is successful

All approaches outperform a random folksonomy

The number of times an agent is successful in finding a path using a particular folksonomy as background knowledge

y

Tag generality approaches outperform k-means / Aff. Propagation

max hops n: the maximal number of steps an agent is allowed to perform before stopping (a tunable

t t l f ll li k )

n

39

Markus Strohmaier 2011

Propagationparameter e.g., an agent only follows n links).

Knowledge Management Institute

Success Rates Across Different Datasets

Holds for all datasets(to diff

But how efficient are

those(to diff. extents)

those folksonomies

during search?Efficiency: how often does an agent not

40

Markus Strohmaier 2011

search?Efficiency: how often does an agent not find the global shortest path, but some

other path that is longer.

Knowledge Management Institute

Stretch Δ = p pStretch Δ = pLK-pGKShortest Paths found with Local Knowledge

Bib K M

Finds no path: Δ = infinite

Bibsonomy K-Means

Δ infiniteFinds paths that is +1 longer:Δ = 1

T litHolds for all

d t t Finds shortest possible path:Δ = 0

Tag generality approaches (d+e) find much shorter

paths!

datasets(to diff. extents)

paths!

41

Markus Strohmaier 2011

Knowledge Management Institute

Conclusions

D h t l d l f i ti th• Dsearch as a natural model of user navigation on the webEmergence of dynamic user generated links reduces• Emergence of dynamic, user-generated links reduces control

• Empirical studies and new algorithms are needed to• Empirical studies and new algorithms are needed to recover important system properties

42

Markus Strohmaier 2011

Knowledge Management Institute

End of Presentation

Acknowledgements

43

Markus Strohmaier 2011