Upload
votuong
View
243
Download
0
Embed Size (px)
Citation preview
Knowledge Management Institute
Link Analysis andLink Analysis and Decentralized Search
Markus Strohmaier, Denis HelicM lti di l I f ti t IIMultimediale Informationssysteme II
1
Markus Strohmaier 2011
Knowledge Management Institute
The Memex (1945)
A mechanized private libraryfor individual use
Mi i i ti
The Memex [Bush 1945]:
Mimics associative memorywhere users can
insert documentsnavigate documentsretrieve documentsbuild trails through documents
Operated and maintained
B
(i) (ii) pindividually
But trails can be shared socially
(i) (ii)
ye.g.
(i) a user A can send trail to user B(ii) user B modifies and shares it with user C(iii) user C uses the trail for navigation
A C(iii)
C‘s interaction with documents
2
Markus Strohmaier 2011
( ) g
[Bush 1945] V. Bush. As We May Think. Atlantic Monthly, 1945.
C s interaction with documents is mediated by user A and B
Knowledge Management Institute
Web based Retrieval: Challenges
W ki ith t f d tWorking with an enormous amount of data10 billion pages a 500kB estimated in 01-2004
2 / th l b– 2 pages / person on the globe
20 times larger than the LoC print collection– estimated in 2003estimated in 2003
Furthermore there is a Deep Web– 550 billion pages estimated in 2004p g
3
Markus Strohmaier 2011
Knowledge Management Institute
Web based Retrieval: Challenges
E l f th t f bExample for the amount of web pages:– Searching for ‘Star Trek’ yielded about 11 million of results on
Google [Nov 2007]g [ ]– Ordinary users investigate 20-30 result list entries.
What web page is the most interesting?How to store an index (inverted file) with this size?
4
Markus Strohmaier 2011
Knowledge Management Institute
Web based Retrieval: Challenges
Th W b i hi hl d iThe Web is highly dynamicStudy by Cho & Garcia-Molina (2002):
40% f th b h d th i d t t ithi k– 40% of the web pages changed their dataset within a week– 23% of the .com pages changed on daily basis
Study by Fetterly et al (2003):Study by Fetterly et al. (2003):– 35 % of the pages changed during investigations– Larger web pages change more often
5
Markus Strohmaier 2011
Knowledge Management Institute
Web based Retrieval: Challenges
Th W b i lf i dThe Web is self-organizedNo central authority (for the WWW) or main indexEveryone can add (even edit) pagesEveryone can add (even edit) pagesPages disappear on regular basis
– A US study claimed that in 2 investigated tech. journals 50% of the y g jcited links were inaccessible after four years.
Lots of errors and falsehood, no quality control
6
Markus Strohmaier 2011
Knowledge Management Institute
Web based Retrieval: Challenges
Th W b i h li k dThe Web is hyperlinked
Based on HTML Markup tags and URIsPages are interconnected
U idi ti l li k (i li k t li k lf li k)– Unidirectional links (in-link, out-link, self-link)
Network structures emerge from the linksLink analysis is possible– Link analysis is possible
7
Markus Strohmaier 2011
Knowledge Management Institute
The World Wide Web (1990-2000)
A user‘s interaction with the web is
mediated by (a few) editors and publishers
9
Markus Strohmaier 2011
Knowledge Management Institute
The World Wide Web Today (2010)
Interaction between individuals andcomputational systems is mediated by thecomputational systems is mediated by theaggregate behavior of massive numbers
(millions) of users.
10
Markus Strohmaier 2011
Knowledge Management Institute
Social Computation pinfluences system properties (X)
X=UtilityX=Findability
Emergent system properties are beyond the direct control of engineers.
N h d d l i h f d i i d h i i lNew methods and algorithms for designing and shaping social-computational systems are needed.
X Navigability X R l
It is through the process of social computation, i.e. the combination of social behavior and algorithmic computation, that desired and undesired system properties and functions emerge.X=Navigability X=Relevance
y p p g
11
Markus Strohmaier 2011
Knowledge Management Institute
Example:Example:X = Connectivity (of the web graph)
Questions:• What is X like? • What causes X?
bow-tie architectureof the web
12
Markus Strohmaier 2011
[Broder et al 2000]
Knowledge Management Institute
Example:Example:X = Connectivity (of the web graph)
• What causes X? • How can we Questions:• What is X like?
improve X?bow-tie architectureof the web
Social mechanisms, such as preferential attachment
an open problem
Degree of vertex i
The sum of all
vertices‘
Preferential attachment:
problem e cesdegrees
Probability of a new vertex attaching to a vertex i with degree k
13
Markus Strohmaier 2011
[Broder et al. 2000] [Barabasi and Albert 1999]
[Barabasi and Albert 1999] A.-L. Barabási, R. Albert, Emergence of Scaling in Random Networks, Vol. 286. no. 5439, pp. 509 - 512 , Science, 15 October 1999. [Broder et al. 2000] A. Broder, R. Kumar, F. Maghoul, P.Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure on the web. In 9th International WWW Conference, 2000.
vertex i with degree k
Knowledge Management Institute
Analysis of Dynamic Links in Social Tagging Systemsin Social Tagging Systems
How can navigability in social tagging t b d ib d d i d?systems be described and improved?
D. Helic, C. Trattner, M. Strohmaier and K. Andrews, On the Navigability of Social Tagging Systems, The 2nd IEEE International Conference on Social Computing (SocialCom 2010) Minneapolis Minnesota USA 2010
14
Markus Strohmaier 2011
IEEE International Conference on Social Computing (SocialCom 2010), Minneapolis, Minnesota, USA, 2010. (acceptance rate 33/245, 13,47% quota, nominated for Best Paper).
Knowledge Management Institute
Structure of Social Tagging Systems: DefinitionTags
User
ResourcesTags
User
A folksonomy is a tuple F:= (U, T, R, Y) whereth th di j i t fi it t U T R d t user 1• the three disjoint, finite sets U, T, R correspond to – a set of persons or users u ∈ U – a set of tags t ∈ T and
user 1
– a set of resources or objects r ∈ R
• Y ⊆ U ×T ×R, called set of tag assignmentstag 1 res. 1
navigation
15
Markus Strohmaier 2011
[Hotho et al 2006]
Knowledge Management Institute
Tag Clouds are Assumed to beTag Clouds are Assumed to be Efficient Tools for Navigation
The Navigability Assumption:• An implicit assumption among designers of social tagging
web
An implicit assumption among designers of social taggingsystems that tag clouds are specifically useful to support navigation.
• This has hardly been tested or critically reflected in the pastThis has hardly been tested or critically reflected in the past.
Navigating tagging systems via tag clouds:1. The system presents a tag cloud to the user.2. The user selects a tag from the tag cloud.3. The system presents a list of resources tagged with the
selected tag.4. The user selects a resource from the list of resources.5. The system transfers the user to the selected resource,
and the process potentially starts anew.
16
Markus Strohmaier 2011
Navigating Y ⊆ T × R
Knowledge Management Institute
Navigability
Informal Description: If / how quick one can get from document A to
d t B i h t t tdocument B in a hypertext system(more precise definition follows later)
Designing for Navigability:In traditional hypertext systems, this property used to beyp y , p p y
within the control of system designers
17
Markus Strohmaier 2011
Knowledge Management Institute
Defining Navigability
A network is navigable iff:There is a path between all or almost all pairs of nodes
i th t kin the network. [Kleinberg 1999]
Formally:Formally:1. There exists a giant component: size(GC) > 0.9 * n
a single connected component that accounts for a significant fraction of all nodesg p g
2. The effective diameter deff is low: deff < log ndeff = distance at which 90% of pairs of nodes are reachable
n number of nodes in
18
Markus Strohmaier 2011[Kleinberg 1999] J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000.
n… number of nodes in the network
Knowledge Management Institute
Navigability: Examples
Example 1:
Not navigable: No giant component
Example 2:
Not navigable: giant component BUTNot navigable: giant component, BUTavg. shortest path > log2(9)
19
Markus Strohmaier 2011
Knowledge Management Institute
Navigability: Examples
Example 3:
Navigable: Giant component AND avg shortest path ≤ 2 < log (9)avg. shortest path ≤ 2 < log2(9)
Is this efficiently navigable?Is this efficiently navigable? There are short paths between all nodes, but can an
agent or algorithm find them with local knowledge
20
Markus Strohmaier 2011
only?
Knowledge Management Institute
Efficiently navigable
A network is efficiently navigable iff:If there is an algorithm that can find a short path with
l l l k l d d th d li ti f thonly local knowledge, and the delivery time of thealgorithm is bounded polynomially by logk(n).
Example 4:B
Example 4:
A C
Efficiently navigable, if the algorithm knows it needs togo through A B C
21
Markus Strohmaier 2011J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)
Knowledge Management Institute
Navigability of Social Tagging SystemsDatasets Annotations ResourcesBut: how does
(i) th i f t l d dAustria Forum 32,245 12,837
Bibsonomy 916,495 235,339
CiteULike 6,328,021 1,697,365
(i) the size of tag clouds and(ii) number of resources / tag influence the navigability (X1) of social tagging systems?
established systems,many users Navigable inmany users New system,
few usersNavigable in theory: GC exists, low eff. diameter
Shrinking diameter over time, cf.[Leskovec et al. 2005]
22
Markus Strohmaier 2011
(for Y ⊆ T × R)
[Leskovec et al. 2005] J. Leskovec, J.M. Kleinberg, C. Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187
Knowledge Management Institute
Modeling UI constraints
Tag Cloud Size n
number of n tags displayed per resource(with a topN algorithm)
Pagination of resources / tag
number of k resources displayed per tag(with reverse chronological ordering)
23
Markus Strohmaier 2011
Knowledge Management Institute
How UI constraints effect NavigabilityTag Cloud Size
Pagination
Limiting the tag cloud size n to practically feasible sizes (e.g. 5, 10, or more) does not influence navigability (this is not very surprising).
BUT: Limiting the out-degree of high frequency tags k (e.g. through pagination with resources sorted in reverse-chronological order) leaves the network vulnerable to fragmentation. This destroys navigability of prevalent approaches
24
Markus Strohmaier 2011
vulnerable to fragmentation. This destroys navigability of prevalent approaches to tag clouds.
Knowledge Management Institute
Findings1 F t i ifi b t l t l d i th1. For certain specific, but popular, tag cloud scenarios, the
so-called Navigability Assumption does not hold. 2. While we could confirm that tag-resource networks have g
efficient navigational properties in theory, we found that popular user interface decisions significantly impair navigabilitynavigability.
These results make a theoretical and an empirical argument against existing approaches to tag cloud construction.
How can we recover navigability of social taggingHow can we recover navigability of social tagging systems?
25
Markus Strohmaier 2011
Knowledge Management Institute
Recovering Navigability in Social TaggingRecovering Navigability in Social Tagging Systems
Instead of reverse-chronological ordering of resources, we apply a naive random ordering.
Based on this observation, we have developed ordering algorithms thatbalance semantic and navigational aspects e g [Trattner et al 2010]
26
Markus Strohmaier 2011
balance semantic and navigational aspects, e.g. [Trattner et al. 2010]
[Trattner et al. 2010] C. Trattner, M. Strohmaier, and D. Helic. Improving navigability of hierarchically-structured encyclopedias through effective tag cloud construction. In 10th International Conference on Knowledge Management and Knowledge Technologies I-KNOW 2010, Graz, Austria, 2010.
Knowledge Management Institute
Navigating Networksg g
How can model user navigation on networks?
27
Markus Strohmaier 2011
Knowledge Management Institute
Experiment [Milgram]GoalGoal• Define a single target person and a group of starting persons• Generate an acquaintance chain from each starter to the targetE i t l S t U
1933-1984
Experimental Set Up• Each starter receives a document• was asked to begin moving it by mail toward the target• Information about the target: name, address, occupation, company,
college, year of graduation, wife’s name and hometown• Information about relationship (friend/acquaintance) [Granovetter 1973]Constraints• starter group was only allowed to send the document to people they
know and • was urged to choose the next recipient in a way as to advance the
progress of the document toward the target
28
Markus Strohmaier 2011
Knowledge Management Institute
Introduction
The simplest way of formulating the small-world problem is: Starting with any two people in the world, what is the likelihood that they will know each other?likelihood that they will know each other?
A somewhat more sophisticated formulation, however, takes account of the fact that while person X and Z may not knowaccount of the fact that while person X and Z may not know each other directly, they may share a mutual acquaintance -that is, a person who knows both of them. One can then think of an acquaintance chain with X knowing Y and Y knowing Zan acquaintance chain with X knowing Y and Y knowing Z. Moreover, one can imagine circumstances in which X is linked to Z not by a single link, but by a series of links, X-A-B-C-D…Y-Z. That is to say, person X knows person A who in turn knowsZ. That is to say, person X knows person A who in turn knows person B, who knows C… who knows Y, who knows Z.
[Milgram 1967, according to]http://www.ils.unc.edu/dpr/port/socialnetworking/theory_paper.html#2]
29
Markus Strohmaier 2011
Knowledge Management Institute
An Experimental Study of the Small WorldAn Experimental Study of the Small World Problem [Travers and Milgram 1969]
A S i l N t k E i t t il d t dA Social Network Experiment tailored towards• Demonstrating• Defining• And measuringInter-connectedness in a large society (USA)
A test of the modern idea of “six degrees of separation”Which states that: every person on earth is
fconnected to any other person through a chain of acquaintances not longer than 6
30
Markus Strohmaier 2011
Knowledge Management Institute
Results I
H f th t t ld b bl t t bli h• How many of the starters would be able to establish contact with the target?– 64 out of 296 reached the target– 64 out of 296 reached the target
• How many intermediaries would be required to link starters with the target?g– Well, that depends: the overall mean 5.2 links– Through hometown: 6.1 links
Th h b i 4 6 li k– Through business: 4.6 links– Boston group faster than Nebraska groups– Nebraska stockholders not faster than Nebraska random
• What form would the distribution of chain lengths take?
31
Markus Strohmaier 2011
Knowledge Management Institute
Results III .
C th• Common paths• Also see:
Gladwell’s “Law of the few”Gladwell’s “Law of the few”
32
Markus Strohmaier 2011
Knowledge Management Institute
F ll k (2008)Follow up work (2008)http://arxiv.org/PS_cache/arxiv/pdf/0803/0803.0939v1.pdf
H it d L k t d 2008– Horvitz and Leskovec study 2008– 30 billion conversations among 240 million people of Microsoft
Messenger– Communication graph with 180 million nodes and 1.3 billion
undirected edges– Largest social network constructed and analyzed to date (2008)g y ( )
33
Markus Strohmaier 2011
Knowledge Management Institute
Decentralized SearchDecentralized Search
Background knowledge:
Idea: use folksonomies as background knowledge Then, the performance of decentralized search
Shortest path to targetBackground knowledge:(a tag hierarchy)
g gpdepends on the suitability of folksonomies.
In other words, we can evaluate the suitability of
Folksonomy1
Folksonomy...
Folksonomyn
folksonomies for decentralized search through simulations.
shortest path found with l l k l d 4A (tag-tag) network:
Goal: Navigate from START to TARGETusing local and background knowledge
local knowledge pLK = 4
Δ = pLK-pGK
start target
using local and background knowledge only
shortest path with candidates
34
Markus Strohmaier 2011J. Kleinberg. The small-world phenomenon: An algorithmic perspective. Proc. 32nd ACM Symposium on Theory of Computing, 2000. Also appears as Cornell Computer Science Technical Report 99-1776 (October 1999)
pglobal knowledge pGK = 3
Knowledge Management Institute
Evaluating Hierarchical Structures in NetworksNetworks
How can measure the efficiency of hi hi l t t f i ti ?hierarchical structures for navigation?
35
Markus Strohmaier 2011
Knowledge Management Institute
The World Wide Web (1990-2000)
How efficient is this as a navigational aid?
36
Markus Strohmaier 2011
Knowledge Management Institute
Construction of hierarchies fromConstruction of hierarchies from unstructured tagging data
F t t lit t t litFrom tag centrality to tag generality:high tag centrality:more abstract
low tag centrality:more specific
Other existing folksonomy algorithms: k-means, affinity propagation, …
37
Markus Strohmaier 2011
[Heyman and Garcia-Molina 2006]
Knowledge Management Institute
Evaluation Framework
Decentralized SearchSearch
Folksonomy 1 Simulation Performance
Evaluationwhich folksonomy performs best on a given navigational task
Click-Data
Evaluation
Explanatory E l ti
which folksonomy explains actual user behavior best
navigational task
Folksonomy …
Evaluation actual user behavior best
Folksonomy n
38
Markus Strohmaier 2011
Knowledge Management Institute
Success Rates Across Different FolksonomiesTag generalityflickr dataset
k-means /
Tag generality approaches
Random
affinity propagation
Random folksonomy
Success rate: The number of times an agent is successful
All approaches outperform a random folksonomy
The number of times an agent is successful in finding a path using a particular folksonomy as background knowledge
y
Tag generality approaches outperform k-means / Aff. Propagation
max hops n: the maximal number of steps an agent is allowed to perform before stopping (a tunable
t t l f ll li k )
n
39
Markus Strohmaier 2011
Propagationparameter e.g., an agent only follows n links).
Knowledge Management Institute
Success Rates Across Different Datasets
Holds for all datasets(to diff
But how efficient are
those(to diff. extents)
those folksonomies
during search?Efficiency: how often does an agent not
40
Markus Strohmaier 2011
search?Efficiency: how often does an agent not find the global shortest path, but some
other path that is longer.
Knowledge Management Institute
Stretch Δ = p pStretch Δ = pLK-pGKShortest Paths found with Local Knowledge
Bib K M
Finds no path: Δ = infinite
Bibsonomy K-Means
Δ infiniteFinds paths that is +1 longer:Δ = 1
T litHolds for all
d t t Finds shortest possible path:Δ = 0
Tag generality approaches (d+e) find much shorter
paths!
datasets(to diff. extents)
paths!
41
Markus Strohmaier 2011
Knowledge Management Institute
Conclusions
D h t l d l f i ti th• Dsearch as a natural model of user navigation on the webEmergence of dynamic user generated links reduces• Emergence of dynamic, user-generated links reduces control
• Empirical studies and new algorithms are needed to• Empirical studies and new algorithms are needed to recover important system properties
42
Markus Strohmaier 2011