Freshness Policy

Freshness Policy

Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar

Department of Computer ScienceUniversity of Southern California

Los Angeles, CA

Freshness Policy

• Freshness policy also known as Revisit policy is the process of determining the order and time to re-crawl the web pages by any crawler.

• By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions which will make the crawled data out-of-date.

• In order to display latest results to the user search engine must have an efficient revisit policy.

• An efficient revisit policy will not only save time and bandwidth but also keep search engines data up-to-date.

Metrics for evaluation of Freshness Policy

• Two metrics for determining how up to date a site is can be described as follows:• Freshness: This is a binary measure that indicates whether the local copy is

accurate or not. The freshness of a pagepin the repository at timetis defined as:

• Age: This is a measure that indicates how outdated the local copy is. The age of a page in the repository, at time is defined as:

Methodology

• Tracked over 90 sites over a period of 2 weeks.• We divided them into 4 categories:

• Movies• Technology• Education• News

• Sites selected based on Alexa traffic Rankings – Rohan and Binoy • Developed crawler in Java to download original as well as cached

version of Google and Bing for each web page twice a day – Binoy and Rohan

• Implemented our own code to extract date and time from the cache for each web page - Rohan

• Implemented our own Diff functionality to detect changes in a web page over a period of time which ignored html tags and scripts and considered data between the tags – Madhura

• Data Integration – Madhura• Data Analysis – Binoy, Rohan and Madhura• Study of Nutch Adaptive Fetch Policy - Binoy

NUTCH 1.2 Setup

• Installed Nutch with Lucene on local machine for crawling

• Settings used for Nutch Crawling <name>db.fetch.interval.default</name><value>172800</value>

<name>db.fetch.schedule.class</name><value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>

<name>db.fetch.schedule.adaptive.inc_rate</name><value>0.4</value>

<name>db.fetch.schedule.adaptive.dec_rate</name><value>0.2</value>

Nutch Crawling Snapshot

• Average Freshness achieved with Nutch Fetch Policy – 0.5

Data Integration and Calculations (Excel)

• Data Snippet after Integration

• Age and Freshness Calculations • Per SiteAverage Age per Site = (Sum of Ages)/ (Number of Crawls)Average Freshness per Site = (Sum of Freshness) / (Number of Crawls)

• Per CategoryAverage Age per Category = (Sum of Average Site Ages) / (Site Count)Average Freshness per Category = (Sum of Average Site Freshness) / (Site Count)

Standard Deviation

• Standard Deviation in Age for a Category (Days) = sqrt [ (sum of squares of age difference) / Site Count ]

Category Bing Std Dev Google Std DevEducation 4.019501547 1.115546433Movies 8.308207969 1.137753946News 2.30323971 1.335194148Technology 4.482674018 1.08891161

Data Analysis

• Age Comparison between Google and Bing

• Conclusions :• Google Database is much more up to date as compared to Bing• Google crawls news sites more than once a day• Google crawling cycle is mostly consistent across different

categories• Google average crawling cycle is 0.8 Days• Bing average crawling cycle is 4.6 Days

Data Analysis

• Freshness Comparison between Google and Bing

• Conclusions :• News sites change frequently and so even though the Age for

News sites is low, cached page is usually not fresh• Google Average Freshness is 0.65• Bing Average Freshness is 0.28

Data Analysis

• Comparison of Standard Deviation across Domains

• Conclusions :• Google’s standard deviation is low which indicates category of a

site is not a major factor while deciding frequency of crawl• Same inference does not apply for Bing

Data Analysis

• Alexa Rank (x-axis) vs Google Cache Age (y-axis)

• Conclusion: • Google - Sites with high traffic are crawled more frequently

Data Analysis

• Alexa Rank (x-axis) vs Bing Cache Age (y-axis)

• Conclusion :• Bing crawling is uniform across sites with varying traffic volume

Data Analysis

• Date Modified vs Crawl Date

• Conclusion :• Google Crawling seems to be more adaptive to original site

changes while Bing crawling is uniform for sites with high ranking

Data Analysis

• Date Modified vs Crawl Date

• Conclusion :• Google as well as Bing Crawling seems to be uniform for low

ranking sites

Conclusions

• Google Freshness PolicyFactors Identified• Popularity/Traffic volume• Category not considered• Frequency of Change of a page affects Crawling cycle – Adaptive !

• Bing Freshness PolicyFactors Identified• Site popularity is not considered• Category is considered• Frequency of Change of a page affects Crawling cycle – Adaptive !

Limitations and Future Work

• Limitations• Conclusions are drawn on a limited random data sample because of

• Crawling restrictions on Google cached data• Change in Bing cached links every time Bing’s cached repository is

updated• Larger time frame is required to identify crawling behavior of each

search engine• High Freshness was observed for Nutch as crawling interval was low

• Future Work• Additional factors like number of incoming and outgoing links can

be noted and its co-relation to crawling can be observed• Factors like ranking, popularity, number of outgoing links can be

incorporated in Nutch Adaptive Fetch Policy

Documents

Freshness Policy