Upload
dalia
View
80
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Modeling and Managing Content Changes in Text Databases. Alexandros Ntoulas UCLA. Panos Ipeirotis New York University. Junghoo Cho UCLA. Luis Gravano Columbia University. Metasearchers Provide Access to Text Databases. Large number of hidden-web databases available - PowerPoint PPT Presentation
Citation preview
Modeling and Managing Content Changes in Text Databases
Panos IpeirotisNew York University
Alexandros NtoulasUCLA
Junghoo ChoUCLA
Luis GravanoColumbia University
Panos Ipeirotis – New York University
thrombopenia
Metasearchers Provide Access to Text Databases
Metasearcher
NYTimesArchives
PubMed USPTO
Broadcasting queries to all databases not feasible (~100,000 DBs)
•Large number of hidden-web databases available
•Contents not accessible through Google
•Need to query each database separately
Panos Ipeirotis – New York University
thrombopenia
Metasearchers Provide Access to Text Databases
Metasearcher
NYTimesArchives
PubMed USPTO
...
thrombopenia 26,887...
...
thrombopenia 0...
...
thrombopenia 42...
??
Database selection relies on simple content summaries: vocabulary, word frequencies
Panos Ipeirotis – New York University
Extracting Content Summaries from Text DatabasesFor hidden-web databases (query-only access):• Send queries to database• Retrieve top matching documents• Use document sample as database representative
For “crawlable” databases:• Retrieve documents by following links (crawling)• Stop when all documents retrieved
Content summary contains: Words in sample (or crawl) Document frequency of each
word in sample (or crawl)
PubMed (11,868,552 documents)
Word #Documents
aids 123,826 cancer 1,598,896 heart 706,537hepatitis 124,320thrombopenia 26,887
…
Panos Ipeirotis – New York University
Never-update Policy
Current practice: construct summary once, never update Extracted (old) summary may:
Miss new words (from new documents) Contain obsolete words (from deleted document) Provide inaccurate frequency estimates
NY Times (Oct 29, 2004)
Word #Docs
…
NY Times (Mar 29, 2005)
Word #Docs
…
•tsunami (0) •recount 2,302•grokster 2
•tsunami 250•recount (0)•grokster 78
Panos Ipeirotis – New York University
Research Challenge
Updating summaries is costly!
Challenge: Maintain good quality of summaries, and Minimize number of updates
If summaries do not change Problem solved!
If summaries change Estimate rate of change and schedule updates
Panos Ipeirotis – New York University
Outline
Do content summaries change over time?
Which database properties affect the rate of change?
How to schedule updates with constrained resources?
Panos Ipeirotis – New York University
Randomly picked from Open Directory
Multiple domains
Multiple topics
Searchable (to construct summaries by querying) Crawlable (to retrieve full contents)
Data for our Study: 152 Web Databases
www.wsj.com, www.intellihealth.com, www.fda.gov, www.si.edu, …
Panos Ipeirotis – New York University
Study period: Oct 2002 – Oct 2003 52 weekly snapshots for each database 5 million pages in each snapshot (approx.) 65 Gb per snapshot (3.3 Tb total)
For each week and each database, we built: Complete summary (by scanning all pages)
Approximate summary (by query-based sampling)
Data for our Study: 152 Web Databases
Panos Ipeirotis – New York University
Measuring Changes over Time
Recall: How many words in current summary also in old (extracted) summary? Shows how well old summaries
cover the current (unknown) vocabulary
Higher values are better
Precision: How many words in old (extracted) summary still in current summary? Shows how many obsolete
words exist in the old summaries
Higher values are better
Results for complete summaries (similar for approximate)
Panos Ipeirotis – New York University
Summaries over Time: Conclusions
Databases (and their summaries) are not static
Quality of old summaries deteriorates over time
Quality decreases for both complete and approximate content summaries (see paper for details)
How often should we refresh the summaries?
Panos Ipeirotis – New York University
Outline
Do content summaries change over time?
Which database properties affect the rate of change?
How to schedule updates with constrained resources?
Panos Ipeirotis – New York University
Survival Analysis
Initially used to measure length of survival of patients under different treatments (hence the name)
Used to measure effect of different parameters (e.g., weight, race) on survival time
We want to predict “time until next update” and find database properties that affect this time
Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs”
Panos Ipeirotis – New York University
Survival Analysis for Summary Updates
“Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required)
Old summary changes at time t if:
KL divergence(current, old) > τ
Survival analysis estimates probability that a database summary changes within time t
change sensitivity threshold
Panos Ipeirotis – New York University
Modeling Goals
Goal: Estimate database-specific survival time distribution
Exponential distribution S(t) = exp(-λt) common for survival times λ captures rate of change Need to estimate λ for each database Preferably, infer λ from database properties (with no “training”)
Intuitive (and wrong) approach: data + multiple regression Study contains a large number of “incomplete” observations Target variable S(t) typically not Gaussian
Panos Ipeirotis – New York University
Survival Times and “Incomplete” Data
week
“Survival times” for a database
X
XX
XX
Week 52, end of study
“Censored” cases
Many observations are “incomplete” (aka “censored”) Censored data give partial information (database did not change)
Panos Ipeirotis – New York University
Using “Censored” Data
S(t), best fit, ignoring censored data
S(t), best fit, using censored data
By ignoring censored cases we get (under) estimates perform more update operations than needed
By using censored cases “as-is” we get (again) underestimates Survival analysis “extends” the lifetime of “censored” cases
X
XX
XX
X
S(t), best fit, using censored data “as-is”
Panos Ipeirotis – New York University
Database Properties and Survival Times
For our analysis, we use Cox Proportional Hazards Regression
Uses effectively “censored” data (i.e., database did not change within time T)
Derives effect of database properties on rate of change E.g., “if you double the size of a database, it changes twice as
fast” No assumptions about the form of the survival function
Panos Ipeirotis – New York University
Rate of change increases
Rate of change decreases
Cox PH Regression Results
Examined effect of: Change-sensitivity threshold τ Topic Size Number of words Differences of summaries extracted in consecutive weeks
Domain
(higher τ longer survival)
(details in next slide)
(does not matter, except for health-related sites)
(larger databases change faster!)
(does not matter)
(sites that changed frequently in the past, change frequently in the future)
Panos Ipeirotis – New York University
Baseline Survival Functions by Domain
Effect of domain:
GOV changes slower than any other domain
EDU changes fast in the short term, but slower in the long term
COM and other commercial sites change faster than the rest
Panos Ipeirotis – New York University
Cox PH analysis gives a formula for predicting the time between updates for any database
Rate of change depends on: domain database size history of change threshold τ
Results of Cox PH Analysis
By knowing time between updates we can schedule update operations better!
Panos Ipeirotis – New York University
Outline
Do content summaries change over time?
Which database properties affect the rate of change?
How to schedule updates with constrained resources?
Panos Ipeirotis – New York University
Deriving an Update Policy
Naïve policy: Updates all databases at the same time (i.e., assumes
identical change rates) Suboptimal use of resources
Our policy: Use change rate as predicted by survival analysis Exploit database-specific estimates for rate of change
Panos Ipeirotis – New York University
Scheduling Updates
Database Rate of change λ
average time between updates
10 weeks 40 weeks
Tom’s Hardware 0.088 5 weeks 46 weeks
USPS 0.023 12 weeks 34 weeks
With plentiful resources, we update sites according to their rate of change
When resources are constrained, we update less often sites that change “too frequently”
Panos Ipeirotis – New York University
Scheduling Results
Clever scheduling improves quality of summaries (according to KL, precision and recall)
Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper)
Panos Ipeirotis – New York University
Updating Content Summaries: Contributions
Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases
Change frequency model: showed that database characteristics can predict time between updates
Scheduling algorithms: devised update policies that exploit “survival model” and use efficiently available resources
Panos Ipeirotis – New York University
Current and Future Work
Current: Compared with machine learning techniques Applied technique for web crawling
Future: Apply survival analysis for refreshing db statistics
(materialized views, index statistics, …) Examine efficiency of survival analysis models Create generative models for modeling database
changes
Panos Ipeirotis – New York University
Related Work
Brewington & Cybenko, WWW9, Computer 2000
Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003
Coffman, J.Scheduling, 1998
Olston & Widom, SIGMOD 2002