6

Click here to load reader

Arcomem training heritrix_advanced

  • Upload
    arcomem

  • View
    168

  • Download
    1

Embed Size (px)

DESCRIPTION

This presentation on using the Heritrix crawler is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Citation preview

Page 1: Arcomem training heritrix_advanced

Adaptive Heritrix

ATHENA – Research and Innovation Center in Information, Communication and Knowledge Technologies

Page 2: Arcomem training heritrix_advanced

ARCOMEM Requirements for crawling

• ARCOMEM aims to guide crawling based on– Advanced semantic link extraction– Use of social media– Analysis of crawled content in large-scale distributed

environment

• These aims require a crawler to – Update adaptively priorities– Operate as a service

2Adaptive Heritrix

Page 3: Arcomem training heritrix_advanced

Adaptive Prioritization

• New Heritrix frontier class– Plug & Play with open source Heritrix– Minimal configuration

• Adding forward index for URLs– locates a link already scheduled for crawling

• Moves scheduled link to the place corresponding to the updated priority

3Adaptive Heritrix

Page 4: Arcomem training heritrix_advanced

Heritrix as a crawling service• Decoupled fetching and link prioritization

• Writing crawled data to modified WARC files– WARCS are loaded on Hbase by different process

• Efficient URL injection end-point– Receives scored links from online analysis and API crawler– ARCOMEM-specific JSON format of outlinks– External-memory queue to handle large volumes of links

4Adaptive Heritrix

Page 5: Arcomem training heritrix_advanced

Assessing the impact of adaptive prioritization

• Simulations to evaluate how adaptive prioritization affects performance of a focused crawler– Simulation on 3 DMOZ topics: Genetics, Recycling, Oceanography

• Running simulated crawl– Start from set of 20 randomly selected seeds (repeated 3 times)– Topic vector is the sum of the seed vectors– Crawl 10,000 web pages

• Compare the effectiveness of a best-first crawler to– Adaptive prioritization: priorities are updated using MAX, MIN, AVG,

SUM, FIRST, LAST functions

5Adaptive Heritrix

Page 6: Arcomem training heritrix_advanced

Adaptive Prioritization results

6

Update function Harvest Ratio Average Similarity DMOZ topics

FIRST 0.3317 0.2945 0.4979AVG 0.3609 0.3024 0.5779MAX 0.3388 0.2967 0.5270SUM 0.2679 0.2759 0.4650LAST 0.3404 0.2961 0.5985FIRST 0.3317 0.2945 0.4979

• AVG and LAST have highest harvest ratios and find most pages from DMOZ topics

• Adaptive prioritization more effective that FIRST, i.e. Best-First crawler

Adaptive Heritrix