20
Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University

Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement

Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman

Center for the Study of Digital Libraries and the Department of Computer Science

Texas A&M University

Distributed Collections

• The Web is eternally changing– .gov and .edu pages change less frequently than

.com pages (1999)

• Collection managers cannot control changes– Bookmark lists– Yahoo! directories– Web portals (NSDL)– Walden’s Paths

Walden’s Paths

The Walden’s Paths Project is developing tools to help you organize, annotate, and maintain collections of different locations chosen from the World-Wide Web.

Controls Annotation

Original Web page

Information about the pathPaths in the system and in the subcollection

holding the current path

Currently on stop 3 of 6Scroll stop list

Current path’s name

Source for contentSwitch to presentation mode

Management of Distributed Collections

• Detection of change is easy

• Determination of – Quantity of change is relatively easy– Relevance of change is less easy– Quality of change is difficult

• Approaches– Human validation (Yahoo! surfers)– Automatic detection of change (Path Manager)

Path Manager

• Features– Java-based

– HTML pages

– Web page state• Original

• Last validated

• Most recent

• Kinds of change– Content changes (what)– Presentation changes (how)– Structural changes (linking)– Behavioral changes

– Paths or bookmark lists– Signatures

• Paragraphs• Headings• Images• Links• Checksum

Path Status Overview

Path Manager

Page Status Overview

Little Change

Server unreachable

404 error

No change

Drastic change

Page Information

Modification details

Quantitative information

Is ALL Change Bad?

• Quite unrelated to the philosophical question• Items in a collection

– Play specific roles– Are semantically related

• To each other• To the collection

• Change to an item may– Change its relationship to the collection

• Less coherent with other items (default assumption)• More coherent or no change in relationship

– Affect the role it plays in the collection• Less suitable (default assumption)• More suitable or no effect on the role

Context-based Change Detection

• Augments our content-based approach

• Context consists of – Content from other pages in the path– Annotations created by the author– Additional metadata provided by the author

• Distinguishes between edited and replaced pages

Approach

• Context Generation Phase– Create weighted page term vectors

• W = log(tf) + constant scaling factor

• Known nouns are allocated higher weights

– Create weighted context term vectors• Exclude the page for which context vector is being generated

• Change Detection Phase– Calculate page term vector for changed page

– Calculate the angle between new page term vector and context term vector

– Difference between initial and current angle is a measure of the change

Evaluation

• 20 paths, pages selected from Yahoo! Directories• Each path consisted of 10 to 12 pages• Pages were randomly selected

– no flash presentations or images

• A page in each path was randomly selected for replacement

• Each selected page was replaced by 3 pages– CNN Financials (large change)

– Elephants (large change)

– A page from the same Yahoo! Directory (small change)

• Intuitive expectation

Results – Content-based Metrics

Replaced with Page about elephants

CNN Financials page

Page in same Yahoo!

directory

Average 78.1 81.9 75.1

Range 30.8 to 88.1 77.0 to 87.7 35.1 to 84.5

Standard deviation 15.65 2.89 10.76

Angle between original and replacing pages (in degrees)

High angle of change for all cases (even for similar pages)

Results – Context-based Metrics

Replaced with Page about elephants

CNN Financials page

Page in same Yahoo!

directory

Average -7.8 -9.1 1.9

Range -23.2 to 1.6 -45.0 to 0.9 -15.2 to 14.3

Standard deviation 6.95 10.57 6.80

Difference in angle to Yahoo directory between original and replacing pages (in degrees)

In line with intuitive expectation

Results – Distribution of Context-based changes

More than -4 Between -4 and 2 More than 2

Replacement by a member of the Yahoo! Directory

1 (5%) 10 (50%) 9 (45%)

Replacement by non-member 25 (62.5%) 15 (37.5%) 0 (0%)

Replacements resulting in moving towards and away from the context vector

• Preliminary experimental thresholds– 2-degree or greater change converges the page to the path

– 2 to negative 4 degrees is treated as no change

– Greater than negative 4 degrees diverges the page from the path

Dealing with Missing Pages

• Pages may disappear due to a variety of reasons– Reconfiguration of Web sites– Change or expiration of domain names

• Temporary or permanent• Threaten integrity of paths

– Continuity of narrative structure– Completeness of collection

• Strategies– Find new locations for pages that have moved (exact

replacements)– Find acceptable replacements for pages that have

vanished (similar pages)

Approach – Information Extraction Phase

Extract keyphrases from page– Extends the “Robust Hyperlinks” approach– Tag text in pages with part-of-speech tagger– Extract 1, 2 or 3 word keyphrases– n, n-n, a-n, n-n-n, a-n-n– Use HTML formatting for additional guidance

• Only <LINK> or <A> tags may separate terms in a phrase

• Store separate lists of keyphrases to use for locating exact replacements and similar pages

Approach – Locating Exact Replacements

• Keyphrases help discriminate this page from others on the Web

• TF-IDF-based measure• Spelling mistakes and unusually uncommon words

are most valuable• Order keyphrase list by decreasing value of TF-

IDF measure• While locating pages

– Begin with a (user-specified) keyphrase set

– Search for pages that match these terms

– Add a term and retry until the result set is as desired

Approach – Finding Similar Pages

• Rare phrases hinder search for similar pages• Weed out phrases that have occurred less

frequently than a certain threshold value• Remaining phrases are then ordered by decreasing

value of their TF-IDF measure• While locating pages

– Start with the most restrictive set of phrases

– Reduce one phrase at a time until the desired result size is achieved

• Similarity is contextual• Varies from person to person

Future Work

• Integrate context-based algotrithms with Path Manager

• Integrate the algorithms for locating page replacements

• Improve algorithms for keyphrase extraction

• Etc. etc.

For more information on

Walden’s Pathshttp://www.csdl.tamu.edu/walden/

[email protected]

Principal Investigators:Richard Furuta ([email protected])

Frank Shipman ([email protected])