View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement
Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman
Center for the Study of Digital Libraries and the Department of Computer Science
Texas A&M University
Distributed Collections
• The Web is eternally changing– .gov and .edu pages change less frequently than
.com pages (1999)
• Collection managers cannot control changes– Bookmark lists– Yahoo! directories– Web portals (NSDL)– Walden’s Paths
Walden’s Paths
The Walden’s Paths Project is developing tools to help you organize, annotate, and maintain collections of different locations chosen from the World-Wide Web.
Controls Annotation
Original Web page
Information about the pathPaths in the system and in the subcollection
holding the current path
Currently on stop 3 of 6Scroll stop list
Current path’s name
Source for contentSwitch to presentation mode
Management of Distributed Collections
• Detection of change is easy
• Determination of – Quantity of change is relatively easy– Relevance of change is less easy– Quality of change is difficult
• Approaches– Human validation (Yahoo! surfers)– Automatic detection of change (Path Manager)
Path Manager
• Features– Java-based
– HTML pages
– Web page state• Original
• Last validated
• Most recent
• Kinds of change– Content changes (what)– Presentation changes (how)– Structural changes (linking)– Behavioral changes
– Paths or bookmark lists– Signatures
• Paragraphs• Headings• Images• Links• Checksum
Path Status Overview
Path Manager
Page Status Overview
Little Change
Server unreachable
404 error
No change
Drastic change
Page Information
Modification details
Quantitative information
Is ALL Change Bad?
• Quite unrelated to the philosophical question• Items in a collection
– Play specific roles– Are semantically related
• To each other• To the collection
• Change to an item may– Change its relationship to the collection
• Less coherent with other items (default assumption)• More coherent or no change in relationship
– Affect the role it plays in the collection• Less suitable (default assumption)• More suitable or no effect on the role
Context-based Change Detection
• Augments our content-based approach
• Context consists of – Content from other pages in the path– Annotations created by the author– Additional metadata provided by the author
• Distinguishes between edited and replaced pages
Approach
• Context Generation Phase– Create weighted page term vectors
• W = log(tf) + constant scaling factor
• Known nouns are allocated higher weights
– Create weighted context term vectors• Exclude the page for which context vector is being generated
• Change Detection Phase– Calculate page term vector for changed page
– Calculate the angle between new page term vector and context term vector
– Difference between initial and current angle is a measure of the change
Evaluation
• 20 paths, pages selected from Yahoo! Directories• Each path consisted of 10 to 12 pages• Pages were randomly selected
– no flash presentations or images
• A page in each path was randomly selected for replacement
• Each selected page was replaced by 3 pages– CNN Financials (large change)
– Elephants (large change)
– A page from the same Yahoo! Directory (small change)
• Intuitive expectation
Results – Content-based Metrics
Replaced with Page about elephants
CNN Financials page
Page in same Yahoo!
directory
Average 78.1 81.9 75.1
Range 30.8 to 88.1 77.0 to 87.7 35.1 to 84.5
Standard deviation 15.65 2.89 10.76
Angle between original and replacing pages (in degrees)
High angle of change for all cases (even for similar pages)
Results – Context-based Metrics
Replaced with Page about elephants
CNN Financials page
Page in same Yahoo!
directory
Average -7.8 -9.1 1.9
Range -23.2 to 1.6 -45.0 to 0.9 -15.2 to 14.3
Standard deviation 6.95 10.57 6.80
Difference in angle to Yahoo directory between original and replacing pages (in degrees)
In line with intuitive expectation
Results – Distribution of Context-based changes
More than -4 Between -4 and 2 More than 2
Replacement by a member of the Yahoo! Directory
1 (5%) 10 (50%) 9 (45%)
Replacement by non-member 25 (62.5%) 15 (37.5%) 0 (0%)
Replacements resulting in moving towards and away from the context vector
• Preliminary experimental thresholds– 2-degree or greater change converges the page to the path
– 2 to negative 4 degrees is treated as no change
– Greater than negative 4 degrees diverges the page from the path
Dealing with Missing Pages
• Pages may disappear due to a variety of reasons– Reconfiguration of Web sites– Change or expiration of domain names
• Temporary or permanent• Threaten integrity of paths
– Continuity of narrative structure– Completeness of collection
• Strategies– Find new locations for pages that have moved (exact
replacements)– Find acceptable replacements for pages that have
vanished (similar pages)
Approach – Information Extraction Phase
Extract keyphrases from page– Extends the “Robust Hyperlinks” approach– Tag text in pages with part-of-speech tagger– Extract 1, 2 or 3 word keyphrases– n, n-n, a-n, n-n-n, a-n-n– Use HTML formatting for additional guidance
• Only <LINK> or <A> tags may separate terms in a phrase
• Store separate lists of keyphrases to use for locating exact replacements and similar pages
Approach – Locating Exact Replacements
• Keyphrases help discriminate this page from others on the Web
• TF-IDF-based measure• Spelling mistakes and unusually uncommon words
are most valuable• Order keyphrase list by decreasing value of TF-
IDF measure• While locating pages
– Begin with a (user-specified) keyphrase set
– Search for pages that match these terms
– Add a term and retry until the result set is as desired
Approach – Finding Similar Pages
• Rare phrases hinder search for similar pages• Weed out phrases that have occurred less
frequently than a certain threshold value• Remaining phrases are then ordered by decreasing
value of their TF-IDF measure• While locating pages
– Start with the most restrictive set of phrases
– Reduce one phrase at a time until the desired result size is achieved
• Similarity is contextual• Varies from person to person
Future Work
• Integrate context-based algotrithms with Path Manager
• Integrate the algorithms for locating page replacements
• Improve algorithms for keyphrase extraction
• Etc. etc.
For more information on
Walden’s Pathshttp://www.csdl.tamu.edu/walden/
Principal Investigators:Richard Furuta ([email protected])
Frank Shipman ([email protected])