Upload
cecily-clarke
View
223
Download
4
Tags:
Embed Size (px)
Citation preview
Innovation through Understanding of the Data and
the Human Behaviour
June 12, 2008
Natasa Milic-Frayling
Microsoft Research Cambridge
Presentation at Jozef Stefan Institute, Ljubljana, Slovenia – June 12’08
Web Site Structure AnalysisConcepts, Algorithms and Evaluation Issues
Eduarda Mendes Rodrigues† Natasa Milic-Frayling†
Martin Hicks Blaz Fortuna‡
† Microsoft Research, Cambridge, UK‡ Institute Jožef Stefan, Slovenija
Outline
Research in Web navigation Objectives and overview of the LSG approach
Concepts supported by the user study Definition and application of the LSG model
for Web site structure analysis
LSG method for partitioning Web sites into subsites
Identification of subsite entry pages
Challenges in the evaluation of subsites Evaluation methodology, issues and
guidelines
Navigation support
Site structure model
Detection of subsites
Evaluation issues
Part I
Part II
Supporting Search and Navigation
Users often use navigation as a complement to search:
preference for navigation over search
information need is clear to the user, but queries are not formulated appropriately (short and ambiguous queries, user’s inexperience with search, wrong terminology)
information need is vague or ambiguous – navigation is used for exploration of content and refining the need
Site structure representation:
navigation aid
context for search results
Objectives
Represent and analyse the navigational and content structure of individual Web sites
Identify fine-grained site boundaries and define the scope of sub-sites for a particular application
Characterize the relationship between site structure, content and usage of the Web sites
Web Link Graph
p2
pk
p1
.
.
.
pk+2
pn
pk+1
.
.
.
Page p1
p3
Nodes represent Web pages Types of links and association of links
are not represented
p3 pk
p2
pk+2
pn
pk+1
p1
Web Link Graph
p2
pk
p1
g1
g2
.
.
.
Target pagespk+2
pn
pk+1
.
.
.
Target pages
Page p1
p3
Nodes represent Web pages Types of links and association of links
are not represented
g1
g2
Structural block
Content block
p1
p3 pk
p2
Targets of g1
. . .
Containers of g1
Targets of g2
pk+2
pn
pk+1. . .
p1
Containers of g2
Nodes of the LSG are link blocks and the edges represent a containment relationship
LSG captures page-level organization of links and the overall link structure
LSG – Link Structure Graph
Web Link Graph Nodes represent Web pages Types of links and association of links
are not represented
Concept Validation: Exploratory User Study
Objective:– to identify notions that Web users may have about sites,
organization of pages and functions of hyperlinks
Motivation: – Related work on Web page analysis has included user
evaluations of algorithms but very few have explored how users perceive Web pages and the hyperlinks that connect them
We focused on three aspects:– do Web users perceive and understand different types of links?
– are Web users able to detect associations between links present on the page?
– do Web users consider some links to be more important than others?
Study design We recruited 14 participants (9 male, 5 female) – all confirmed
that they regularly use the Internet, with an average reported usage of 25 hours/week
<1000 1000-10000 10001-100000 >1000000
2
4
6
8
10
12
Size
Num
ber
of S
ites
MSNYahoo!
TOPICS SUB-TOPICS
Arts directory, literature, television
Computers
internet, software, graphics
Health conditions, diseases, occupational & safety
News newspapers, weather
Reference
libraries, education
Science institution, math, earth sciences
Society issues, government, law
A sample of 21 sites was selected from 7 top-level topic categories of the ODP directory (http://dmoz.org) - the sample included sites from four different domains (.com, .edu, .gov, .org) and of heterogeneous sizes
Study design (contd.) Session 1
Participants freely navigated 3 Web sites (approx. 5 minutes each)
For each site, participants were prompted to elaborate on site organization, the importance of links shown on the page, and to estimate the size of the site Session 2 (this session was video recorded)
Participants were shown 2 printed pages from each of the same sites and were asked to consider if the links on the pages could be grouped, e.g. by content, functionality, etc.
They were encouraged to discuss their impressions of the pages and to group and label links on the printed pages
Session 3 – Participants were shown identical pages as in session 2 via a
computer-based application, which detects links and highlights link blocks on the page, and were prompted for feedback on the detected link blocks and their prominence on the page
Study design (contd.)Page analyser application used in Session 3 of the study
• For each page, the program prompted users to:
– judge if links formed a coherent group, a menu
– indicate if some of the links had been missed out
– rank the prominence of links on the page
Session 1 observationsBased on their initial impressions and limited browsing, participants:
Rated 24 out of 42 visited sites as providing well organised content
Considered 30 out of 42 visited sites as having links of varying importance in terms of content and navigation:
an analysis of participants’ written comments revealed their opinions to be influenced by page layout (presentation of information, presence or absence of sidebars, screen clutter)
Correctly estimated size ranges for 23 of 42 sites (based on the size estimates obtained from the three search engines: Google, MSN and Yahoo!)
Session 2 observations
0
10
20
30
40
50
60
70
Num
ber
of P
age
s
Content ortopic
Navigation Other (general purpose,housekeeping, internal/external)
Administrative
Analysis of the users comments about the types of links on the pages revealed that participants characterised them as:
– content or topic (relating to the content of the site)
– navigation (for moving to other parts of the site or to external sites)
– administrative (referring to company information, privacy policy, sitemaps)
– general purpose, ‘housekeeping’ links, internal/external
Issues: Variability in Terminology and Mental Models
While grouping links, participants showed individual differences in how they were influenced by the layout and information presented to them, e.g.
– some participants used different terms for describing the same type of links: links at the bottom of a page (company information, privacy policy, site maps) were independently referred to as ‘administration’, ‘bureaucracy’, and ‘footnotes’
– two participants revealed that they ignored some content on the right side of the page; one elaborated: ‘I expect the most important links to be shown on the left as I naturally read left to right’
• This participant grouped the links on the page according to particular categories
• While the labelling suggests the links are content links, the participant regarded these as navigation links
• This participant referred to links at the top of the page as a ‘main menu’; commenting whether the PBS logo represented a link to a home page
• He further commented that the sidebar panes ‘were not very good’; he was not able to tell what function they served or whether they were related
• Note, he also highlighted links at the bottom of the page as ‘technology’ related links, which included links to news RSS feeds & podcasts, and also ‘smallprint & legalise’ links
• The same participant did not immediately acknowledge that this page was from the same site due to it’s different appearance from the previous page (home page)
• He noted that the only correlation between the two pages is the menu; commented on how the main content links on the page appeared very similar with no main headings
Issues: Consistency Across Sessions Cross session analysis revealed correlation and discrepancy in
participants’ comments across sessions, e.g.:
– One participant noted the importance of links on a page for one site during sessions 1 and 2 but did not rate them as prominent on the page analyzer application in session 3
– Conversely, another participant referred to the importance of ‘content’ and ‘navigation’ links for 2 sites during session 2 and rated these links as prominent when using the page analyzer in session 3
– Five participants ranked certain links on the page as more important than others during sessions 1 and 2. Page analyzer logs revealed that they had also rated these links as prominent or very prominent, and in some cases, assigned the same attributes to the groups of links
Summary of Findings Perceived importance of links: users could articulate and
elaborate on different functions and importance of links on pages
Structure of the page: layout, organization of content, and location of links influence user’s perception of function of links and site usability
Structure of links: participants could outline groupings of links and refer to their functions, and in some instances, consistently assigned importance rating to the groups of links
Link categories and differences in terminology: some common links were referred to using different terminology, although findings also show commonality in the links identified
Web site size estimates: although only briefly exposed to site content, users were able to provide estimates of size based on various visual and organizational aspects of the pages they visited
Outline
Research in Web navigation Objectives and overview of the LSG approach
Concepts supported by the user study Definition and application of the LSG model
for Web site structure analysis
LSG method for partitioning Web sites into subsites
Identification of subsite entry pages
Challenges in the evaluation of subsites Evaluation methodology, issues and
guidelines
Navigation support
Site structure model
Detection of subsites
Evaluation issues
Part I
Part II
Site Structure Representation
Basic concepts we use for link structure analysis:
– Structural link blocks: organizational and navigational link blocks typically repeated across pages with the same layout and underpinning the organization of the site
– Content link blocks: expected to be grouped by content associations, unlikely to be repeated across pages and point to information resources
– Isolated links: links that are not part of a link group and may be only loosely related with each other
Link Structure Graph (LSG)
Defined a Link Structure Graph (LSG) to captures:
– both the organization of links at the page level – and the overall hyperlink structure of the site
The graph includes 3 types of nodes:
– s-nodes (structural link blocks)– c-nodes (content link blocks)– i-nodes (bag of isolated links on the page)
LSG Algorithm Step 1
– Analyse the layout of individual pages by parsing the HTML Document Object Model (DOM) structure
– Based on the DOM paths identify candidate link blocks and the remaining isolated nodes
Step 2– Classify the link blocks into s-node and c-nodes base on
their re-usability across pages Step 3
– Connect the LSG nodes with directed edges, each of which represents a containment relationship between target pages of the source node and the destination link block
– The edges are weighted to completely preserve the information about in- and out-links of individual pages
LSG Applications: Site Structure Analysis
• Analyse the structural properties of the sample sites selected for the user study– variability with size of site, topic and domain– possible correlation with users comments
• Analyse the incremental generation of LSGs using different crawling strategies
Characterization of the Sample Sites
Despite differences in size, topic and domain, for most sites:
– more than 50% of the pages reside up to 3 directory levels down from the root directory
0 1 2 3 4 5 6 7 8 9+0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Directory Level
Rat
io o
f si
tes
with
at
leas
t p%
pag
es b
elow
fix
ed le
vel
p = 25%
p = 50%p = 75%
0 1 2 3 4 5 6 7 8 9+0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Depth Level
Rat
io o
f si
tes
with
at
leas
t p%
pag
es a
t a
fixed
dep
th
p = 25%
p = 50%p = 75%
– more than 50% of the pages are 3 to 5 clicks away from the home page
Analysis of LSG Properties
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ratio of pages that contain link blocks
ratio
of
page
s ac
cess
ible
fro
m li
nk b
lock
s
s-nodes
c-nodes
Template covers a larger structure
Content link blocks found in many pages
Few content blocks point to many pages (e.g. sitemap)
Sites with very little structure
Same template spread across many pages, but touching few pages
Analysis of LSG Properties (contd.)
• Strongly Connected Components (SCCs) of the LSG:
– sub-graphs of the LSG
– there is a path between every pair of nodes
– e.g. navigation menus which are part of thesame template
A
B
C
D
E
F
G
SCC = {A,B,C,D}
Directed LSG
Analysis of LSG Properties (contd.)
• Strongly Connected Components (SCCs) of the LSG:
– sub-graphs of the LSG
– there is a path between every pair of nodes
A
B
C
D
E
F
G
SCC = {A,B,C,D,E,F,G}
Undirected LSG
User Comments vs. LSG properties
• Comments:
‘It is not obvious how I can get at the content I want by hierarchically navigating menus’
‘I think this is a website where you need to know what you are looking for. There is a lot of work in reading the text to find out how you need to make the next step in finding what you want’
• LSG properties: very few pages of this site contain and are targeted by s-node and c-node link blocks
User Comments vs. LSG properties
• Comments:
‘I got very confused in which part I was in. The breadcrumbs said I was in ‘about us’ and I was looking at project information. I think I would use the search bar to get the information that I want.’
• LSG properties: this site has many s-node disconnected components. There are only 10 SCCs with more than 5 s-nodes each and those components only touch 4% of the pages
User Comments vs. LSG properties
• Comments:
‘Easy to navigate the required information, again through the use of grouping and in this case dual menu structure which effects the links available within the side bar which is useful’
‘The high-level topics and search bar are always available. More specific subtopics can be navigated with a panel that changes, suitably to the context’
• LSG properties: 80% of pages of this site contain s-nodes and about 20% of pages of the site are accessible through s-nodes. There are 17 SCCs with more than 5 s-nodes each and collectively touching around 5% of the pages (>1300 pages) of the site.
Outline
Research in Web navigation Objectives and overview of the LSG approach
Concepts supported by the user study Definition and application of the LSG model
for Web site structure analysis
LSG method for partitioning Web sites into subsites
Identification of subsite entry pages
Challenges in the evaluation of subsites Evaluation methodology, issues and
guidelines
Navigation support
Site structure model
Detection of subsites
Evaluation issues
Part I
Part II
Nodes of the LSG are link blocks and the edges represent a containment relationship
LSG captures page-level organization of links AND the overall link structure
LSG – Link Structure Graph
Identification of Subsites
Sites are often organized into several units of content, referring to a particular topic or function
Structure can be presented in terms of subsites
Connected structural link blocks expose the intrinsic organization of subsite content
Identification of Subsites
1. How to define the scope of a subsite?
Set of pages with a shared navigation mechanism, that are likely to present a consistent page style
LSG Strongly Connected Components (SCC) :
Navigation of a subsite involving a sequence of clicks on distinct link blocks imply a path of connected s-nodes in the LSG.
Identifying SCCs isolates nodes that are contained in pages of the same subsite
Subsite pages
Identification of Subsites
2. How to identify entry pages for a subsite?
Web page(s) that facilitate navigation around the subsite and are representative of the subsite content
Define appropriate subsite page scores
Select pages with the highest score
Entry pageSubsite pages
Page and Block Rank Scores
PageRank:
)( )(
)(
|)(|
1)(
ij pNp j
ji
pd
pPRk
GV
kpPR
Probability that a user will navigate to a given page when randomly surfing the Web.
PageRank:
LSG block rank:
)(
)(
)(
|)(|
1)(
ij gNg j
ji gD
gBRk
LSGV
kgBR
Probability that a user will see a link block on a page if randomly navigating the pages using only LSG link blocks.
Entry Page Score
)()()()( isiteisubsiteisitei gBRpPRpPRpEPR
gi is the s-node with the highest BR that is contained
in page pi
Experiments suggested =3, =2 and =1
Using LSG for Site Structure Analysis
Data set: 20 Web sites from DMOZ*, heterogeneous in
topic – covering 7 top-level DMOZ topic categories size – ranging from ~250 pages to ~40000 pages
Link block reach and spread: s-node reach reveals the coverage of content through
structural links s-node spread reveals how widespread the use of a
particular template is across site pages
In-link degree distribution
* Open directory: http://dmoz.org
High variability across sites
Evaluation Issues
Main issues with the evaluation of the detected subsites:
Evaluation of subsites and entry pages requires manual inspection of all the pages of the site, which is impractical
Representative set of web sites should be used to evaluate the algorithms
Pilot study with 2 of the sites from our sample to gain further insight into the complexity of the evaluation task
Proposed evaluation methodology:
Pooling method to obtain candidate entry pages from multiple systems
Engage human assessors to browse each site and decide if the pages from the pool were entry pages of a subsite or not
Pool of Entry Pages for Assessment
A B C D E A B C D EA 5 1 0 3 1 6 0 6 2 0B 2 0 0 0 2 2 0 0
C 4 1 1114
10 15
D 24 6 125 41E 10 70
Site A: www.artifice.com
Site B: www.sigmaxi.org
A. Entry pages manually selected by experts,
B. Pages from the Web site included in the DMOZ directory
C. Index pages such as ‘index.*’ or ‘default.*’
D. First target page of all s-node link blocks
E. Top ranked page, according to the EPR score, for each subsite detected by the LSG decomposition into strongly connected components
Entry Page Assessments
Assessor J1
Yes No Total Yes No Total
AssessorJ2
Yes 1 4 5 7 17 24
No 5 24 29 43 179 222
Total 6 28 34 50 196 246Site A: www.artifice.com Site B: www.sigmaxi.org
Pilot study included 2 human assessors (J1 and J2) that evaluated all entry pages from the pool
Simple GUI to display entry pages and input assessments (yes/no and confidence level in the assessment)
Good agreement on negative assessments, but not so good on the positive ones
Confidence levels on the assessments generally higher on site B
Results of the Entry Page Assessment
A(manual)
B(DMOZ)
C(index*)
D (s-node)
E(EPR)
Site A: www.artifice.com
Assessor J1P: 20%R: 17%
P: 100%R: 33%
P: 25%R: 17%
P: 4%R: 17%
P: 20%R: 33%
Assessor J2P: 20%R: 11%
P: 100%R: 22%
P: 25%R: 11%
P: 21%R: 56%
P: 20%R: 22%
Total pages 5 2 4 24 10
Site B: www.sigmaxi.org
Assessor J1P: 83%R: 8%
P: 100%R: 3%
P: 49%R: 93%
P: 10%R: 20%
P: 20%R: 22%
Assessor J2P: 67%R: 17%
P: 100%R: 8%
P: 19%R: 92%
P: 4% R: 21%
P: 13%R: 38%
Total pages 6 12 114 125 70
(P: set precision, R: set recall – relative to own individual assessments)
Guidelines for Evaluation Support
Provide quick access to the pages in the vicinity of a given page (i.e., the parent, child and sibling nodes)
Provide visual cues such as page thumbnails of flexible size
Make the relationship between the URL and the links on the parent page explicit
Provide easy access to pages that have already been visited during evaluation (e.g. present a navigation trail)
Enable the assessors to customize presentation of candidate entry pages, i.e., as a sorted list, graph, etc.
Concluding Remarks
LSG model – enables in-depth analysis of Web sites and identification of subsites
Proposed a pooling method for gathering relevance judgements
Defined an evaluation methodology and presented guidelines to assist in the creation of a test data set
Future work: large scale evaluation of algorithms for subsite and entry page detection
Research Desktop Activity Based Computing
Eduarda Mendes Rodrigues Natasa Milic-Frayling†
Gabriella Kazai Gavin Smyth
Rachel Jones
Gerard Oleksik
Background• Scholars, researchers• Range of activities in order to accomplish tasks
– Gathering relevant sources of information – Reading through the material – Annotating and note taking– Analyzing the material– Communicate findings to colleagues – Author publications
• Workflows of different styles– Structured and un-structured– Short-lived projects and life-long work
Research Desktop
• Research Desktop augments the standard desktop environment with concepts and designs that enable new ways of working and managing resources
• It provides support in four key areas: – Activities– Tools– Library– Notes.
Research Desktop Activities
• Activity-centric content access• Label (tag) related resources• Activate a task or switch between
multiple tasks• Resume work• Preserved state• Activity monitor• Toolbar plug-in
Tools
• Tools and services used in various contexts
• Brings tools to the user• Examples:
– Document analysis – Co-author network– Trends discovery