Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Eurostat
THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION
Web scraping tools, a real life application
ESTP course on Big Data Sources – Web, Social Media and Text Analytics, Day 1
Olav ten Bosch, Statistics Netherlands
Eurostat
Aim of this afternoon
• Build a web scraper for a web site of your choice with the CBS Robot Framework
• Learn about web technology (HTML, CSS, XPath)
• Learn about the Robot Framework
• Introduce some useful tools for inspecting web sites
• Hands-on experience with configuring and running theRobot Framework
2
Eurostat
Overview
• Introducing the Robot Framework
• Data extraction
• Coffee break
• Site navigation
3
Eurostat
The CBS Robot Framework
• Used for automated site navigation and data extraction
• Rule based configuration
• Does not require programming
• But: allows programming for advanced use
• Uses a full-blown browser (phantomjs)
• Works with rendered pages, not page source
• Includes a JavaScript engine
• Generates CSV data files and extensive logs.4
Eurostat
Framework config
• Format: JSON• (actually, a Node.js JavaScript module)
• Different sections
• startUrls
• extractionRules
• navigationRules
• (and some others, which are for advanced use)
5
Eurostat
JSON quick reference
name:value assign value to named property
"string" character string
number number
{ } object (set of properties)
[ ] array of values
See also: http://www.json.org/
6
Eurostat
StartUrls
• One or more start URLs
• Each start URL is a separate object
• Must have a unique name
• Must contain url property
• May contain extractionContext and/or navigationContextproperties
7
Eurostat
StartUrls quick referencestartUrls: {
startVariable: "site",
<any_site_name>: {
url: "http://... ",
extractionContext: "overview"
navigationContext: "menu"
},
<any_site_name>: {
...
}
}
see also: Framework user manual, section 2.3 8
Eurostat
Running the Framework
• Config directory: RobotConfig\ESTP
• The following commands are available:
• newrobot <robotname>
• initialises a new, empty framework config
• runrobot <robotname>
• runs a robot
• Output directories:
• RobotOutput\ESTP\<robotname>\data
• RobotOutput\ESTP\<robotname>\log 9
Eurostat
Exercise 1: "Hello, world"
a) Initialise a config file and run it. Inspect theoutput generated.
b) Choose a site to scrape, and choose one page of this site to extract data from.
c) Add the URL from b) as the start URL to yourconfig file and run again.
d) Once more, inspect the output. What has changed since the previous run?
10
Eurostat
Items and properties
• Items: some item of interest on a web page
• Example, web shop: products sold
• Example, news site: articles published
• Property: one piece of information about an item
• Examples, web shop: name, description, brand, price
• Examples, new site: title, body text, author, date
11
Eurostat
HTML syntax
• Tags
• Important tags: <a>, <p>, <h?>, <div>, <span>, <ul> / <li>, <table> / <tr> / <td>, <body>, <html>
• Text content
• Attributes
• id
• class
12
Eurostat
HTML Tags quick reference<a> Hyperlink
<p> Paragraph
<h?> Header. “?” is a single digit between 1 and 6
<div> Section; Rectangular block of content
<span> Line of text
<ul> / <li> Unordered List / List item
<table> Table
<tr> / <td> Table row / Table cell
<body> Document body: visible part of the page
<html> The entire HTML document
See also: http://www.w3schools.com/tags/ 13
Eurostat
CSS selectors
• Originally used in “Cascading Style Sheets” todenote which tags have specific layout
• In conjunction with HTML class attribute
• Layout often has semantic meaning
• E.g., product names, prices, … have specific layouts
• Class name often reflects this meaning
• Used in scrapers to select specific parts of web pages
14
Eurostat
CSS Selectors quick reference
tag Select tags with indicated tag name
#id Select tag with the indicated id
.class Select tags with indicated class
[attr=value] Select tags for which attribute equals value
tag.class select tags with indicated tag name and class
selector1 selector2 select tags obeying selector2 within tags obeying selector1
selector1>selector2 as previous, but children only
selector1,selector2 select tags obeying selector1 or selector2
See also: http://www.w3schools.com/cssref/css_selectors.asp 15
Eurostat
extractionRules
• First select items from which to extract data
• Then select, for each item, elements to extract
• Selection by means of CSS selectors
• extractionContext links start urls and extractionrules
• Use the extraction rules with the same name as theextraction context
16
Eurostat
extractionRules quick reference
extractionRules: {
<extraction_context_name>: {
cssSelector: "<item selector>"
<column_name>: {
cssSelector: "<property selector>",
operation: "getXmlValue"
}
}
}
see also: Framework user manual, section 2.717
Eurostat
Exercise 2: Items of Interest
a) Identify the items on your chosen web page thatyou want to extract data from.
b) Compose a CSS selector to select these items. Test with Firebug & Firepath.
c) Add an extraction context to the config andinclude this CSS selector as item selector. Run the robot with this config.
d) Inspect the output: What has changed since theprevious run?
18
Eurostat
Exercise 3: Gathering Data
a) Identify a single property from the items selected in exercise 2 that you want to extract.
b) Compose a CSS selector for this property.
c) Include this property in the config.
d) Run the config and inspect the output.
e) Repeat a) to d) with other properties of interest.
19
Eurostat
Site navigation overview
• Menus
• Top / Side menu: often hyperlinks
• Pulldown / mouseover menu: combination of CSS andJavaScript
• Multi-level menus
• Next page button
• Often implemented in JavaScript: AJAX
• Filters, facets
• Almost always implemented in JavaScript, sometimesclient-side
20
Eurostat
XPath selectors
• XPath: language to select tags in [X/HT]ML code
• Similar to CSS selectors, but much more powerful
• Syntax somewhat comparable to directory names
• HTML can be seen as a hierarchy, just like a file system
• Example: html/body/div/h1/a
21
Eurostat
XPath syntax overview
/tag find tags as children of the current tag
//tag find tags as descendants of current tag
[n] select the nth tag of the indicated type
[condition] select tags which obey the given condition
@attribute select the indicated attribute of the current tag
text() select the text contents of the current tag
=, != comparison operators: equal to / not equal to
id('<id>') select the tag with the indicated id
See: http://www.w3schools.com/xsl/xpath_syntax.asp
http://www.w3schools.com/xsl/xpath_operators.asp 22
Eurostat
XPath examples
• //ul[@class='nav2']//a[text()='Politics']
• Select all hyperlinks with link text "Politics" inside a <ul> tag with class "nav2"
• //div[contains(@class, 'next')]
• Selects all <div> tags for which the class attributecontains the word "next"
• (id('main-menu')//ul/li)[3]
• First, select all <li> tags which are children of <ul> tags inside a tag with id "main-menu", then select the 3rd of these.
23
Eurostat
Exercise 4: One small step
a) Find the link (probably in a menu) you followedto the web page you used in ex. 1-3. This link should be on a different page on the same site.
b) Compose an XPath selector to select this link.
c) Add a navigation rule with this XPath selector tothe config and run it. What other parts of theconfig do you need to change for this test?
d) Inspect the output.
24
Eurostat
Exercise 5: A giant leap
a) Find some other pages on the site you chose forwhich you would like to extract data. Do theyhave the same structure as the one from ex 1-3?
b) Find out how to navigate to these pages.
c) Add extra navigation rules to your config to visitthese pages.
d) If necessary, add extra extraction contexts / rules.
e) Run config after each change and inspect output.25