34
Specifying crawls France Lasfargues Internet Memory Foundation Paris, France [email protected] Slide 1

Arcomem training Specifying Crawls Beginners

  • Upload
    arcomem

  • View
    234

  • Download
    1

Embed Size (px)

DESCRIPTION

This presentation on Specifying Crawls is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.

Citation preview

Page 1: Arcomem training Specifying Crawls Beginners

Specifying crawls

France Lasfargues Internet Memory Foundation

Paris, [email protected]

Slide 1

Page 2: Arcomem training Specifying Crawls Beginners

Training Goals

➔ Help user to specify properly the campaign

➔ Make user understanding what it is going on in the back end of the ARCOMEM platform

➔ Set-up a campaign in the crawler cockpit

Slide 2

Page 3: Arcomem training Specifying Crawls Beginners

Plan

What is the Web ? Challenges and SOAARCOMEM platformCrawlerSet-up a campaign in the Arcomen Crawler

Cockpit

Slide 3

Page 4: Arcomem training Specifying Crawls Beginners

Introduction : How does Web work ?

➔ The web is managed by protocols and standards :

• HTTP Hypertext Transfer Protocol

• HTML HyperText Markup Language

• URL Uniform Resource Locator

• DNS Domain Name System

➔ Each server has an address : IP address

• Example : http://213.251.150.222/ -> http://collections.europarchive.org

4

Page 5: Arcomem training Specifying Crawls Beginners

WWW

The web is a large space of communication and information :• managed by servers which talk together by convention (protocol) and

through applications in a large network.

• a naming space organized and controlled (ICANN)

World Wide Web: abbreviated as WWW and commonly known as the Web, is a system of interlinked hypertext documents accessed via the internet

Slide 5

Page 6: Arcomem training Specifying Crawls Beginners

HTTP - Hypertext Transfer Protocol

➔ Notion client/server• request-response protocol in the client-server computing model

➔ How does it work ?• Client asks for a content

• Server hosts the content and delivers it

• The browser locates the DNS server, connects itself to the server and sends a request to the server.

6

Page 7: Arcomem training Specifying Crawls Beginners

HTML - HyperText Markup Language

➔ Markup language for Web page

➔ Written in form of HTML elements

➔ Creates structured documents denoting structural semantic elements for text as headings, paragraphs, titles, links, quotes, and other items

➔ Allows text and embedded as images

➔ Example : http://www.w3.org/

7

Page 8: Arcomem training Specifying Crawls Beginners

URI - URL

➔ URL - Uniform resource Locator (URL) that specifies where an identified resource is available and the mechanism for retrieving it.

➔ Examples :

– http://host.domain.extension/path/pageORfile

– http://www.europarchive.org

– http://collections.europarchive.org/

– http://www.europarchive.org/about.php

8

Samos 2013 – Workshop : The ARCOMEM Platform

Page 9: Arcomem training Specifying Crawls Beginners

Domain name and extension

➔ Manage by l’ICANN, Internet Corporation for Assigned Names and Numbers (ICANN), is non profit organization, allocated by registrar.• http://www.icann.org

➔ ICANN coordinates the allocation and assignment to ensure the universal resolvability of :

• Domain names (forming a system referred to as «DNS»)

• Internet protocol («IP») addresses

• Protocol port and parameter numbers.

➔ Several types of TLD• TLD first level : .com, .info, etc

• gTLD : aero, .biz, .coop, .info, .museum, .name, et .pro

• ccTLD (country code Top Level Domains).fr

9

Page 10: Arcomem training Specifying Crawls Beginners

What kind of contents?

➔ Different type of contents : multimedia text, video, images

➔ Different type of producers :

• public : institution, government, museum, TV....

• private : foundation, company, press, people, blog...

http://ec.europa.eu/index_fr.htm

http://iawebarchiving.wordpress.com/

http://www.nytimes.com/

➔ Each producer is in charge of its content

• Information can disappear: fragility

• Size

10

Page 11: Arcomem training Specifying Crawls Beginners

Social web

➔ Focus on people’s socialization and interaction

• Characteristics : • Walled space in which users can interact

• Creation of social network

➔ WEB ARCHIVE -> challenges in term of content, privacy and technique.• Examples:

• Share bookmark(Del.icio.us, Digg), videos (Dailymotion, YouTube), photos (Flickr, Picasa)

• community (MySpace, Facebook)

11

Page 12: Arcomem training Specifying Crawls Beginners

Ex. of technical difficulties: Videos➔ Standard HTTP protocol

• obfuscated links to the video files

• dynamic playlists and channels or configuration files loaded by the player several hops and redirects to the server of the video content

e.g.: YouTube

➔ Streaming protocols: RTSP, RTMP, MMS...

• real-time protocols implemented by the video players suited for large video files (control commands) or live broadcasts

• sometimes proprietary protocols (e.g.: RTMP - Adobe)

available tools: MPlayer, FLVStreamer, VCL

12

Page 13: Arcomem training Specifying Crawls Beginners

Deep /Hidden Web

• Deep web: content accessible behind password, database, payment... and hidden to search engine

13

http://c.asselin.free.fr/french/schema_webinvisible.htm Schema établit sur la base de la figure"Distribution des sites du Deep Web par types de contenu" de l'étude Bright Planet.

Page 14: Arcomem training Specifying Crawls Beginners

How do we archive it ?

➔ Challenges for archiving : – dynamic websites

➔ Technical barriers:• some javascript• animation on Flash• pop-up• video and audio on streaming• restricted access

➔Traps : Spam and loop

14

Page 15: Arcomem training Specifying Crawls Beginners

What do user need to do some web archiving ?

➔ Define the target content (Website, URL, Topic…)

➔ A tool to manage its campaign

➔ Intelligent crawler to archive content

15

Page 16: Arcomem training Specifying Crawls Beginners

Management tools (1) Several tools exist already developed by Libraries which are doing some Library.

➔Netarchivesuite (http://netarchive.dk/suite/)

➔The NetarchiveSuite software was originally developed by the two national deposit libraries in Denmark, The Royal Library and The State and University Library and has been running in production, harvesting the Danish world wide web since 2005. The French National Library and the Austrian National Libraries joined the project in 2008.

➔Web curator tool: http://webcurator.sourceforge.net

Open-source workflow management application for selective web archiving developed by the National Library of New Zealand and the British Library, initiated by the International Internet Preservation Consortium

➔Archive-it http://www.archive-it.org/

A subscription service by Internet Archive to build and preserve collections: allows to harvest, catalogue, manage and browse archived collections

➔Archivethe.net http://archivethe.net/fr/

Service provides by the Internet Memory Foundation.

➔Arcomem crawler cockpit

16

Page 17: Arcomem training Specifying Crawls Beginners

How does a crawler work ?

➔ A crawler is a bot parsing web pages in order to index or and archive them. Robot navigates following links

➔ Link in the center of crawl’s problematic • Explicit links : source code is available and full path is

explicitly stated

• Variable link : source code is available but use variables to encode the path

• Opaque links: source code not available

Example : http://www.thetimes.co.uk/tto/news/

17

Page 18: Arcomem training Specifying Crawls Beginners

Parameters

➔ Scoping function is used to define how depth the crawl will go

• Complete or specific content of a website

• Discovery or focus crawl

➔ Politeness

• Follow the common rules of politeness

➔ Robots.txt

• Follow

➔ Frequency

• How often I want to launch a crawl on this target ?

18

Page 19: Arcomem training Specifying Crawls Beginners

Arcomen Crawlers

• IMF Crawler• Adaptative Heritrix• API Crawler

19

Page 20: Arcomem training Specifying Crawls Beginners

IMF Crawler• Component Name: IMF Large Scale Crawler

– The large scale crawler retrieves content from the web and stores it in an HBase repository. It aims at being scalable: crawling at a fast rate from the start and slowing down as little as possible as the amount of visited URLs grows to hundreds of millions, all while observing politeness conventions (rate regulation, robots.txt compliance, etc.).

• Output:

– Web resources written to WARC files. We also have developed an importer to load these WARC files into HBase. Some metadata is also extracted: HTTP status code, identified out links, MIME type, etc.

20

Page 21: Arcomem training Specifying Crawls Beginners

WARC: example

21

Page 22: Arcomem training Specifying Crawls Beginners

Adaptative Heritrix

➔ Component Name: Adaptive Heritrix

➔ Description: Adaptive Heritrix is a modified version of the open source crawler Heritrix that allows the dynamic reordering of queued URLs

➔ Application Aware Helper

22

Page 23: Arcomem training Specifying Crawls Beginners

ARCOMEM Crawler Cockpit

Page 24: Arcomem training Specifying Crawls Beginners

ARCOMEM Crawler Cockpit

• Requirements described by ARCOMEM user partners (SWR – DW)

• Designed and implemented by IMF

• A UI on top of the ARCOMEM system

• Demo: Crawler cockpit

24

Page 25: Arcomem training Specifying Crawls Beginners

How does it work ?

25

Page 26: Arcomem training Specifying Crawls Beginners

Crawler Cockpit: Functionality• Set-up a campaign by focusing,

event, keyword, entity and URL

• Focus on target content in Social Media Category (blog, forum, video, photo...)

• Run crawl by using API crawler (Twitter, Facebook, YouTube, Flickr)

• Get a campaign overview with qualified statistics

• Do some refinement at crawls time to have a better focus on the target content

• decide what content to archive

26

• Launch crawls following scheduler specifications

• Monitor crawls and get real-time feedback on the progress of the crawlers

• Run crawl with HTML Crawler (Heritrix and IMF Crawler)

• Export the crawled content to a WARC file

Page 27: Arcomem training Specifying Crawls Beginners

Crawler Cockpit Navigation

• Set-up: A campaign is described by an intelligent crawl definition, which associates content target to crawl parameters (schedule and technical parameters).

• Monitor tab give access to statistics provide by the crawler at running time

• Overview: global dashboard on a campaign. The information is organized following different topics: general description of the campaign, metadata, current status, crawl activity, statistics and analysis

• Inspector: A tool to have access into the content as it is stored into Hbase.

• Report: specifications and parameters of a campaign

27

Page 28: Arcomem training Specifying Crawls Beginners

Set-up a campaign

28

• General description • Distinct named entities

(e.g. person, geo location, and organization),Time period Free keywords and Language

• A selection of up to nine SMC (Social Media Categories)

• Schedule: Each campaign has a start and end date. Frequency of the craw is defined by choosing an interval.

Page 29: Arcomem training Specifying Crawls Beginners

Focus on Scoping function

29

Domain: entire web sitehttp://www.site.com

Path: only a specific directory of a websitehttp://www.site.com/actu

Sub domain: http://sport.site.com

Page + context: http://www.site.comhome.html

Page 30: Arcomem training Specifying Crawls Beginners

Focus on scheduler

30

Frequency: weekly, monthly, quaterly …Interval: 1 to 9Calendar: a campaign has a start date and an end date.

Page 31: Arcomem training Specifying Crawls Beginners

Campaign Overview

31

Global dashboard on a campaign: • General description of the campaign • Crawl activity • Keywords• Statistics

• Refine Mode: User can give more or less weight to a keyword.

Page 32: Arcomem training Specifying Crawls Beginners

CC Inspector Tab

32

Inspector tab allows user to •Check the quality of the content before indexing•Access to the content (from HBase), metadata and triples directly related to a resource•Browse a list of URLs ranked by on-line analysis scores is provided.

Page 33: Arcomem training Specifying Crawls Beginners

CC Monitor Tab

33

The Monitor tab gives real time statistics on the running crawl.