25
BlogForever Crawler: Techniques and Algorithms to Harvest Modern Weblogs Olivier Blanvillain 1 , Nikos Kasioumis 2 , Vangelis Banos 3 1 Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland, 2 European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland, 3 Department of Informatics, Aristotle University, Thessaloniki 54124, Greece 1

BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

Embed Size (px)

DESCRIPTION

Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents a key component of the BlogForever platform, the web crawler. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple and robust algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. This approach leads to a scalable blog data extraction process. Furthermore, we show how we integrate a web browser into the web harvesting process in order to support the data extraction from blogs with JavaScript generated content.

Citation preview

Page 1: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

1

BlogForever Crawler:Techniques and Algorithms

to Harvest Modern Weblogs

Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3

1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece

Page 2: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

2

Contents• Introduction:

– The disappearing web,– Blog archiving.

• Our Contributions• Algorithms

– Motivation,– Blog content extraction,– Extraction rules,– Variations for authors, dates and comments.

• System Architecture• Evaluation

– Comparison with 3 web article extraction systems.• Issues and Future Work

Page 3: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

3

The disappearing web

Source: http://gigaom.com/2012/09/19/the-disappearing-web-information-decay-is-eating-away-our-history/

Page 4: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

4

Blog archiving

1. Why archive the web?– Web archiving is the process of collecting portions of

the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public.

2. Blog archiving is a special case of web archiving.3. The blogosphere is a live record of contemporary Society,

Culture, Science and Economy.4. Some blogs contain unique data and valuable information.

– Users take action and make important decisions based on this information.

5. We have a Responsibility to preserve the web.

Page 5: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

5

Blog crawlers

Real-time monitoring Html data extraction engine Spam filtering

Unstructured information

Original data andXML metadata

Blog digital repository

Digital preservation Quality assurance Collections curation Public access APIs Personalised services Information retrieval Public web interface /

Browse, search, export

Harvesting

PreservingManaging and reusing

Web servicesWeb interface

FP7 EC Funded Project

http://blogforever.eu/

Page 6: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

6

Our Contributions• A web crawler capable of extracting blog articles,

authors, publication dates and comments.• A new algorithm to build extraction rules from blog

web feeds with linear time complexity,• Applications of the algorithm to extract authors,

publication dates and comments,• A new web crawler architecture, including how we use a

complete web browser to render JavaScript web pages before processing them.

• Extensive evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms.

Page 7: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

7

Motivation• Extracting metadata and content from HTML

documents is a challenging task.– Web standards usage is low (<0.5% of websites).– More than 95% of websites do not pass HTML validation.

• Having blogs as our target websites, we made the following observations which play a central role in the extraction process:a) Blogs provide web feeds: structured and standardized XML

views of the latest posts of a blog,b) Posts of the same blog share a similar HTML structure.c) Web feeds usually have 10-20 posts whereas blogs contain

a lot more. We have to access more posts than the ones referenced in web feeds.

Page 8: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

8

Content Extraction Overview

1. Use blog web feeds and referenced HTML pages as training data to build extraction rules.

2. Extraction rules capable of locating in HTML page all RSS referenced elements such as:

1. Title,2. Author,3. Description,4. Publication date,

3. Use the defined extraction rules to process all blog pages.

Page 9: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

9

Locate in HTML page all RSS referenced elements

Page 10: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

10

Generic procedure to build extraction rules

Page 11: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

11

Extraction rules and string similarity• Rules are XPath queries.• For each rule, we compute the score based on string similarity.• The choice of ScoreFunction greatly influences the running time

and precision of the extraction process.

• Why we chose Sorensen–Dice coefficient similarity:1. Low sensitivity to word ordering and length variations2. Runs in linear time

Page 12: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

12

Example: blog post title best extraction rule• RSS feed: http://vbanos.gr/en/feed/• Find RSS blog post title: “volumelaser.eim.gr” in html page:

http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/XPath HTML Element Value Similarity

Score/body/div[@id=“page”]/header/h1

volumelaser.eim.gr 100%

/body/div[@id=“page”]/div[@class=“entry-code”]/p/a

http://volumelaser.eim.gr/ 80%

/head/title volumelaser.eim.gr | Βαγγέλης Μπάνος 66%

... ... ...

• The Best Extraction Rule for the blog post title is:/body/div[@id=“page”]/header/h1

Page 13: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

13

Time complexity and linear reformulationPost-order traversal ofthe HTML tree.

Compute node bigramsfrom their children bigrams.

Page 14: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

14

Variations for authors, dates, comments

• Authors, dates and comments are special cases as they appear many times throughout a post.

• To resolve this issue, we implement an extra component in the Score function:– For authors: an HTML tree distance between the evaluated

node and the post content node.– For dates: we check the alternative formats of each date in

addition to the HTML tree distance between the evaluated node and the post content node.• Example: “1970-01-01” == “January 1 1970”

– For comments: we use the special comment RSS feed.

Page 15: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

15

System Architecture• Our crawler is built on top of Scrapy (http://www.scrapy.org/)

Page 16: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

16

System Architecture• Pipeline of operations:

1. Render HTML and JavaScript,2. Extract content,3. Extract comments,4. Download multimedia files,5. Propagate resulting records to the back-end.

• Interesting areas:– Blog post page identification,– Handle blogs with a large number of pages,– JavaScript rendering,– Scalability.

Page 17: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

17

Blog post identification

• The crawler visits all blog pages.• For each URL, it needs to identify whether it is a

blog post or not.• We construct a regular expression based on

blog post RSS to identify blog posts.• We assume that all posts from the same blog

use the same URL pattern.• This assumption is valid for all blog platforms

we have encountered.

Page 18: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

18

Handle blogs with a large number of pages

• Avoid random walk of pages, depth first search or breadth first search.

• Use a priority queue with machine learning defined priorities.

• Pages with a lot of blog post URLs have a higher priority.• Use Distance-Weighted kNN classifier to predict.

– Whenever a new page is downloaded, it is given to the machine learning system as training data.

– When the crawler encounters a new URL, it will ask the machine learning system for the potential number of blog posts and use the value as the download priority of the URL.

Page 19: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

19

JavaScript rendering

• JavaScript is widely used client-side language.• Traditional HTML based crawlers do not see web pages using

JavaScript.• We embed PhantomJS, a headless web browser with great

performance and scripting capabilities.• We instruct the PhantomJS browser to click dynamic

JavaScript pagination buttons on pages to retrieve more content (e.g. Disqus Show More button to show comments).

• This crawler functionality is non-generic and requires human intervention to maintain and extend to other cases.

Page 20: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

20

Scalability

• When aiming to work with a large amount of input, it is crucial to build every system layer with scalability in mind.

• The two core crawler procedures NewCrawl and UpdateCrawl are Stateless and Purely Functional.

• All shared mutable state is delegated to the back-end.

Page 21: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

21

Evaluation• Task:

– Extract articles and titles from web pages• Comparison against three open-source projects:

– Readability (Javascript), – Boilerpipe (Java),– Goose (Scala).

• Criteria:– Extraction success rate,– Running time.

• Dataset:– 2300 blog posts from 230 blogs obtained by the Spinn3r dataset.

• System:– Debian linux 7.2, Intel Core i7-3770 3.4 GHz.

• All data, scripts and instructions to reproduce available at:– https://github.com/OlivierBlanvillain/blogforever-crawler-publication

Page 22: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

22

Evaluation: Extraction success rates

• BlogForever Crawler competitors are generic:– They do not use RSS feeds.– They do not use structural similarities between

web pages.– They can be used with any HTML page.

Page 23: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

23

Evaluation: Running time

• our approach spends the majority of its total running time between the initialisation and the processing of the first blog post.

Page 24: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

24

Issues & Future Work

• Our main causes of failure was:– The insufficient quality of web feeds,– The high structural variation of blog pages in the

same blog.• Future Work– Investigate hybrid extraction algorithms. Combine

with other techniques such as word density or spacial reasoning.

– Large scale deployment of the software in a distributed architecture.

Page 25: BlogForever Crawler: Techniques and algorithms to harvest modern weblogs Presentation at WIMS'14

25

Thank you!

BlogForever Crawler: Techniques and Algorithmsto Harvest Modern Weblogs

Olivier Blanvillain1, Nikos Kasioumis2, Vangelis Banos3

1Ecole Polytechnique Federale de Lausanne (EPFL) 1015 Lausanne, Switzerland,2European Organization for Nuclear Research (CERN) 1211 Geneva 23, Switzerland,3Department of Informatics, Aristotle University, Thessaloniki 54124, Greece

• Contact email: [email protected]• Project code available at:– https://github.com/BlogForever/crawler