Click here to load reader
Upload
arcomem
View
168
Download
1
Embed Size (px)
DESCRIPTION
This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Citation preview
Adaptive Heritrix
ATHENA – Research and Innovation Center in Information, Communication and Knowledge Technologies
ARCOMEM Requirements for crawling
• ARCOMEM aims to guide crawling based on– Advanced semantic link extraction– Use of social media– Analysis of crawled content in large-scale distributed
environment
• These aims require a crawler to – Update adaptively priorities– Operate as a service
2Adaptive Heritrix
Adaptive Prioritization
• New Heritrix frontier class– Plug & Play with open source Heritrix– Minimal configuration
• Adding forward index for URLs– locates a link already scheduled for crawling
• Moves scheduled link to the place corresponding to the updated priority
3Adaptive Heritrix
Heritrix as a crawling service• Decoupled fetching and link prioritization
• Writing crawled data to modified WARC files– WARCS are loaded on Hbase by different process
• Efficient URL injection end-point– Receives scored links from online analysis and API crawler– ARCOMEM-specific JSON format of outlinks– External-memory queue to handle large volumes of links
4Adaptive Heritrix
Assessing the impact of adaptive prioritization
• Simulations to evaluate how adaptive prioritization affects performance of a focused crawler– Simulation on 3 DMOZ topics: Genetics, Recycling, Oceanography
• Running simulated crawl– Start from set of 20 randomly selected seeds (repeated 3 times)– Topic vector is the sum of the seed vectors– Crawl 10,000 web pages
• Compare the effectiveness of a best-first crawler to– Adaptive prioritization: priorities are updated using MAX, MIN, AVG,
SUM, FIRST, LAST functions
5Adaptive Heritrix
Adaptive Prioritization results
6
Update function Harvest Ratio Average Similarity DMOZ topics
FIRST 0.3317 0.2945 0.4979AVG 0.3609 0.3024 0.5779MAX 0.3388 0.2967 0.5270SUM 0.2679 0.2759 0.4650LAST 0.3404 0.2961 0.5985FIRST 0.3317 0.2945 0.4979
• AVG and LAST have highest harvest ratios and find most pages from DMOZ topics
• Adaptive prioritization more effective that FIRST, i.e. Best-First crawler
Adaptive Heritrix