25
Eurostat THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Web scraping tools, a real life application ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1 Olav ten Bosch, Statistics Netherlands

Web scraping tools, a real life application

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web scraping tools, a real life application

Eurostat

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Web scraping tools, a real life application

ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1

Olav ten Bosch, Statistics Netherlands

Page 2: Web scraping tools, a real life application

Eurostat

Aim of this afternoon

• Build a web scraper for a web site of your choice with the CBS Robot Framework

• Learn about web technology (HTML, CSS, XPath)

• Learn about the Robot Framework

• Introduce some useful tools for inspecting web sites

• Hands-on experience with configuring and running theRobot Framework

2

Page 3: Web scraping tools, a real life application

Eurostat

Overview

• Introducing the Robot Framework

• Data extraction

• Coffee break

• Site navigation

3

Page 4: Web scraping tools, a real life application

Eurostat

The CBS Robot Framework

• Used for automated site navigation and data extraction

• Rule based configuration

• Does not require programming

• But: allows programming for advanced use

• Uses a full-blown browser (phantomjs)

• Works with rendered pages, not page source

• Includes a JavaScript engine

• Generates CSV data files and extensive logs.4

Page 5: Web scraping tools, a real life application

Eurostat

Framework config

• Format: JSON• (actually, a Node.js JavaScript module)

• Different sections

• startUrls

• extractionRules

• navigationRules

• (and some others, which are for advanced use)

5

Page 6: Web scraping tools, a real life application

Eurostat

JSON quick reference

name:value assign value to named property

"string" character string

number number

{ } object (set of properties)

[ ] array of values

See also: http://www.json.org/

6

Page 7: Web scraping tools, a real life application

Eurostat

StartUrls

• One or more start URLs

• Each start URL is a separate object

• Must have a unique name

• Must contain url property

• May contain extractionContext and/or navigationContextproperties

7

Page 8: Web scraping tools, a real life application

Eurostat

StartUrls quick referencestartUrls: {

startVariable: "site",

<any_site_name>: {

url: "http://... ",

extractionContext: "overview"

navigationContext: "menu"

},

<any_site_name>: {

...

}

}

see also: Framework user manual, section 2.3 8

Page 9: Web scraping tools, a real life application

Eurostat

Running the Framework

• Config directory: RobotConfig\ESTP

• The following commands are available:

• newrobot <robotname>

• initialises a new, empty framework config

• runrobot <robotname>

• runs a robot

• Output directories:

• RobotOutput\ESTP\<robotname>\data

• RobotOutput\ESTP\<robotname>\log 9

Page 10: Web scraping tools, a real life application

Eurostat

Exercise 1: "Hello, world"

a) Initialise a config file and run it. Inspect theoutput generated.

b) Choose a site to scrape, and choose one page of this site to extract data from.

c) Add the URL from b) as the start URL to yourconfig file and run again.

d) Once more, inspect the output. What has changed since the previous run?

10

Page 11: Web scraping tools, a real life application

Eurostat

Items and properties

• Items: some item of interest on a web page

• Example, web shop: products sold

• Example, news site: articles published

• Property: one piece of information about an item

• Examples, web shop: name, description, brand, price

• Examples, new site: title, body text, author, date

11

Page 12: Web scraping tools, a real life application

Eurostat

HTML syntax

• Tags

• Important tags: <a>, <p>, <h?>, <div>, <span>, <ul> / <li>, <table> / <tr> / <td>, <body>, <html>

• Text content

• Attributes

• id

• class

12

Page 13: Web scraping tools, a real life application

Eurostat

HTML Tags quick reference<a> Hyperlink

<p> Paragraph

<h?> Header. “?” is a single digit between 1 and 6

<div> Section; Rectangular block of content

<span> Line of text

<ul> / <li> Unordered List / List item

<table> Table

<tr> / <td> Table row / Table cell

<body> Document body: visible part of the page

<html> The entire HTML document

See also: http://www.w3schools.com/tags/ 13

Page 14: Web scraping tools, a real life application

Eurostat

CSS selectors

• Originally used in “Cascading Style Sheets” todenote which tags have specific layout

• In conjunction with HTML class attribute

• Layout often has semantic meaning

• E.g., product names, prices, … have specific layouts

• Class name often reflects this meaning

• Used in scrapers to select specific parts of web pages

14

Page 15: Web scraping tools, a real life application

Eurostat

CSS Selectors quick reference

tag Select tags with indicated tag name

#id Select tag with the indicated id

.class Select tags with indicated class

[attr=value] Select tags for which attribute equals value

tag.class select tags with indicated tag name and class

selector1 selector2 select tags obeying selector2 within tags obeying selector1

selector1>selector2 as previous, but children only

selector1,selector2 select tags obeying selector1 or selector2

See also: http://www.w3schools.com/cssref/css_selectors.asp 15

Page 16: Web scraping tools, a real life application

Eurostat

extractionRules

• First select items from which to extract data

• Then select, for each item, elements to extract

• Selection by means of CSS selectors

• extractionContext links start urls and extractionrules

• Use the extraction rules with the same name as theextraction context

16

Page 17: Web scraping tools, a real life application

Eurostat

extractionRules quick reference

extractionRules: {

<extraction_context_name>: {

cssSelector: "<item selector>"

<column_name>: {

cssSelector: "<property selector>",

operation: "getXmlValue"

}

}

}

see also: Framework user manual, section 2.717

Page 18: Web scraping tools, a real life application

Eurostat

Exercise 2: Items of Interest

a) Identify the items on your chosen web page thatyou want to extract data from.

b) Compose a CSS selector to select these items. Test with Firebug & Firepath.

c) Add an extraction context to the config andinclude this CSS selector as item selector. Run the robot with this config.

d) Inspect the output: What has changed since theprevious run?

18

Page 19: Web scraping tools, a real life application

Eurostat

Exercise 3: Gathering Data

a) Identify a single property from the items selected in exercise 2 that you want to extract.

b) Compose a CSS selector for this property.

c) Include this property in the config.

d) Run the config and inspect the output.

e) Repeat a) to d) with other properties of interest.

19

Page 20: Web scraping tools, a real life application

Eurostat

Site navigation overview

• Menus

• Top / Side menu: often hyperlinks

• Pulldown / mouseover menu: combination of CSS andJavaScript

• Multi-level menus

• Next page button

• Often implemented in JavaScript: AJAX

• Filters, facets

• Almost always implemented in JavaScript, sometimesclient-side

20

Page 21: Web scraping tools, a real life application

Eurostat

XPath selectors

• XPath: language to select tags in [X/HT]ML code

• Similar to CSS selectors, but much more powerful

• Syntax somewhat comparable to directory names

• HTML can be seen as a hierarchy, just like a file system

• Example: html/body/div/h1/a

21

Page 22: Web scraping tools, a real life application

Eurostat

XPath syntax overview

/tag find tags as children of the current tag

//tag find tags as descendants of current tag

[n] select the nth tag of the indicated type

[condition] select tags which obey the given condition

@attribute select the indicated attribute of the current tag

text() select the text contents of the current tag

=, != comparison operators: equal to / not equal to

id('<id>') select the tag with the indicated id

See: http://www.w3schools.com/xsl/xpath_syntax.asp

http://www.w3schools.com/xsl/xpath_operators.asp 22

Page 23: Web scraping tools, a real life application

Eurostat

XPath examples

• //ul[@class='nav2']//a[text()='Politics']

• Select all hyperlinks with link text "Politics" inside a <ul> tag with class "nav2"

• //div[contains(@class, 'next')]

• Selects all <div> tags for which the class attributecontains the word "next"

• (id('main-menu')//ul/li)[3]

• First, select all <li> tags which are children of <ul> tags inside a tag with id "main-menu", then select the 3rd of these.

23

Page 24: Web scraping tools, a real life application

Eurostat

Exercise 4: One small step

a) Find the link (probably in a menu) you followedto the web page you used in ex. 1-3. This link should be on a different page on the same site.

b) Compose an XPath selector to select this link.

c) Add a navigation rule with this XPath selector tothe config and run it. What other parts of theconfig do you need to change for this test?

d) Inspect the output.

24

Page 25: Web scraping tools, a real life application

Eurostat

Exercise 5: A giant leap

a) Find some other pages on the site you chose forwhich you would like to extract data. Do theyhave the same structure as the one from ex 1-3?

b) Find out how to navigate to these pages.

c) Add extra navigation rules to your config to visitthese pages.

d) If necessary, add extra extraction contexts / rules.

e) Run config after each change and inspect output.25