Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Retrieval and Extractionfrom Big Data Sources
1. Information Sources, Retrieval and Extraction
2. Connecting to big data sources via API
3. Retrieving information from websites (web scraping)
Outline
1. Information Sources, Retrieval and Extraction
2. Connecting to big data sources via API
3. Retrieving information from websites (web scraping)
Outline
There are almost infinite different sources, but they can be grouped into:
- Search engines
- RSS channels
- Open data
- Social media
Information sources- Source types
Information sources- Search engines
There are different kinds of search engines:
- General
- Google, Yahoo!, Bing...
- Thematic
- Carrot2: http://search.carrot2.org
- Patents
- National Agencies (e.g. http://consultas2.oepm.es/InvenesWeb)
- Google Patents (https://patents.google.com)
- Legal
- The Public Library of Law (http://www.plol.org)
- National Agencies (e.g. http://www.poderjudicial.es/search/indexAN.jsp)
- ...
Information sources- Search engines types
Information sources- RSS channels
Rich Site Summary or Really Simple Sindication, publish the latest news of the site with full or summarized text and metadata, like publishing data or author’s name.
Information sources- Open data
- National agencies (e.g. http://www.ine.es) - Eurostat (http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database) - U.S. Census Bureau
(http://www.census.gov/population/international/data/idb/informationGateway.php)- U.S. Census Bureau - List to national statistics agencies
(http://www.census.gov/population/international/links/stat_int.html)- World Bank (http://data.worldbank.org)- United Nations (http://data.un.org)- CEPAL statistics (http://websie.eclac.cl/infest/ajax/cepalstat.asp)- Asian-Pacific statistics
(http://www.unescap.org/stat/data/swweb_syb2011/DataExplorer.aspx) - OMPI Patents (http://www.wipo.int/patentscope/search/en/search.jsf) - OMPI Brands (http://www.wipo.int/madrid/en/romarin)
Information sources- Open data
Information sources- Social media
1. Information Sources, Retrieval and Extraction
2. Connecting to big data sources via API
3. Retrieving information from websites (web scraping)
Outline
Connecting via API - HTTP protocol
● Application layer protocol
● Syntax and semantics for web
communication.
● HTTP/1.0 & HTTP/1.1
● Disconnected protocol
● Based on:
○ request <-> response
● Plain text messages
● There are no states
● There are no session
WEB EMAIL FTP NEWS
HTTP POP3 SMTP FTP NEWS
TCP/IP
PHYSICAL NET
request
response
request
response
Connecting via API - HTTP messages
REQUEST RESPONSE
GET /tramitation.jsp HTTP/1.1
Host www.uv.es
CLRF
Data
Empty
HTTP/1.1 200 OK
Content-Type: text/html
Content-Length: 45
CLRF
HTML + Img + …
Empty
HTTP MESSAGE
● Initial line
● Header
● CLRF
● Body
Connecting via API - HTTP request methods
Two HTTP Request Methods:
- GET: Requests data from a specified resource
- POST: Submits data to be processed to a specified resource
https://www.w3schools.com/tags/ref_httpmethods.asp
Connecting via API - HTTP response codes
1xx: Informative Messages
- 101 Continue
- 102 Switching Protocols
2xx: Success
- 200 OK
- 201 Created
- 202 Accepted
- 204 No Content
- 206 Partial Content
3xx: Redirection
- 300 Multiple Choice
- 301 Moved Permanently
- 302 Not Found
- 304 Not Modified
4xx: Client Error
- 400 Bad Request
- 401 Unauthorized
- 403 Forbidden
- 404 Not Found
- 405 Method Not Allowed
- 408 Request TimeOut
5xx: Server Error
- 500 Internal Server Error
- 501 Not Implemented
- 502 Bad Gateway
- 503 Service Unavailable
Connecting via API - Example: EuroStat API
http://ec.europa.eu/eurostat/web/json-and-unicode-web-services
Connecting via API - Example: EuroStat API
http://ec.europa.eu/eurostat/web/json-and-unicode-web-services
Connecting via API - Example: EuroStat API
http://ec.europa.eu/eurostat/web/json-and-unicode-web-services
Connecting via API - Example: EuroStat API
http://ec.europa.eu/eurostat/web/json-and-unicode-web-services
http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_H
AB&time=2010&time=2011&indic_na=B1GMz
URL http://ec.europa.eu/eurostat/wdds
Service rest/data
Version v2.1
Format json
Lang en
Dataset nama_gdp_c
Filters precision=1
geo=EU28 <- European Union (28 countries)
unit=EUR_HAB <- Euros per inhabitant
time=2010
time=2011 <- Years 2010 and 2011
indic_na=B1GMz <- National Account Indicator: gross domestic product at market prices
Connecting via API - Example: EuroStat API
http://ec.europa.eu/eurostat/web/json-and-unicode-web-services
{"version":"2.0","label":"GDP and main components - Current prices","href":"http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_HAB&time=2010&time=2011&indic_na=B1GM","source":"Eurostat","updated":"2016-02-10","extension":{"datasetId":"nama_gdp_c","lang":"EN","description":null,"subTitle":null,"status":{"label":null}},"class":"dataset","value":{"0":24400,"1":25100},"dimension":{"unit":{"label":"unit","category":{"index":{"EUR_HAB":0},"label":{"EUR_HAB":"Euro per inhabitant"}}},"indic_na":{"label":"indic_na","category":{"index":{"B1GM":0},"label":{"B1GM":"Gross domestic product at market prices"}}},"geo":{"label":"geo","category":{"index":{"EU28":0},"label":{"EU28":"European Union (28 countries)"}}},"time":{"label":"time","category":{"index":{"2010":0,"2011":1},"label":{"2010":"2010","2011":"2011"}}}},"id":["unit","indic_na","geo","time"],"size":[1,1,1,2]}
Response in JSON format
Connecting via API - JSON format
https://www.hurl.it/
JSON: JavaScript Object Notation.
- A syntax for storing and exchange data.
- Text written in JavaScript object notation.
- Most web messages nowadays are written in JSON.
- Low payload.
- Easily converted into JavaScript objects.
- Hardly understood by humands at glance ---> needs formatter.
Connecting via API - JSON formatter
http://jsonlint.com
Paste JSON code in
the text area and
click Validate JSON
Connecting via API - JSON formatter
http://jsonlint.com
The result is more human
readable. It is a hierachical
format:
[ ] array
{ } object
“key”: “value” attributes in key-
value format
1. Information Sources, Retrieval and Extraction
2. Connecting to big data sources via API
3. Retrieving information from websites (web scraping)
Outline
Webscraping- Definition
“
Web scraping, web harvesting or web data extraction is data scraping used
for extracting data from websites.
Web scraping software may access the World Wide Web directly using the
Hypertext Transfer Protocol (HTTP).
It is a form of copying, in which specific data is gathered and copied from the
web, typically into a central local database or spreadsheet, for later retrieval
or analysis.
”
https://en.wikipedia.org/wiki/Web_scraping
Webscraping- rvest
rvest is new package that makes it easy to
scrape (or harvest) data from html web pages,
inspired by libraries like beautiful soup. It is
designed to work with magrittr so that you can
express complex operations as elegant
pipelines composed of simple, easily
understood pieces.There are some dependencies to solve:
- libcur14-openssl-dev
- libssl-dev
- libxml2-dev
Webscraping- Example: LEGO Film
http://www.imdb.com/title/tt1490017/
Let’s extract the following information from
the LEGO film:
- Users rating
- Cast
Webscraping- HTML + CSS
HTML: HyperText Marckup Language
- A language to create webpages.
- Tags-based syntax.
- E.g. <body></body><table></table>...
- Should describe contents
CSS: Cascading Style Sheets
- A language to describe how HTML elements should be displayed on screen, paper or in other media.
- Key-value-based syntax associated to HTML elements.
- E.g. body {background-color: black; color: white; font: verdana; }
Webscraping- CSS selector
http://selectorgadget.com
There are several CSS selectors. Let’s install SelectorGradget
for Chrome.
Webscraping- CSS selector
Click on “Add to Chrome” and accept.
A new icon will appear in Chrome toolbar.
Let’s click on it.
Move the mouse over the rating and click.
Copy the CSS value.
Write the next code in R.
Webscraping- Scraping rating
Webscraping- Scraping cast
To scrap the cast it is
needed to make the css
selection twice:
- Select the table
- Select the column
Then, write the following R
code.
Webscraping- Example: gold prices
https://www.measuringworth.com/datasets/gold/result.php
Webscraping- Example: gold prices
https://www.measuringworth.com/datasets/gold/result.php
To obtain the table of results, the
user has to choose some
parameters:
- A series of markets
- Initial year
- Ending year
Hence, this is a dynamic webpage
where the user send parameters
in a request to the server. There
are two possibilities:
- GET
- POST
Webscraping- Firebug Tool
http://getfirebug.com/releases/lite/chrome/
Firebug allows to inspect HTML
source code
Webscraping- Firebug Tool
Click the Firebug icon on the browser toolbar.
An inspection window opens below the webpage.
Click the button Inspect and inspect the web.
Webscraping- Inspecting the form
When inspecting the form, we observe
that the method is POST.
Webscraping- Looking for parameters
The needed parameters are the
following:
- london
- goldsilver
- newyork
- us
- British
- year_source
- year_result
In order to make a POST request in R, the following libraries are needed:
- rvest
- httr
Webscraping- Required libraries
Webscraping- POST request
Instead of making a direct reques to the url, that by default is made as GET,
we need to enconde the request within a POST object with the following
parameters:
- The url
- The query (parameters) as a list of pairs key=value
Then, the request is made with the content function:
Webscraping- Results formatting
To obtain the dataset a chain of functions (%>%) is used:
- html_prices contains the webpage html
- html_nodes(“table”) selects the list of elements below the table CSS
element
- .[[2]] selects the second element in the list
- html_tables converts to a R table
The result is stored in prices, the first row removed and the column names
assigned.
Webscraping- The whole R code
● HTTP: The Definitive Guide. David Gourley, Brian Totty, Marjorie Sayer, Anshu
Aggarwal, Saily Reddy. O’Reilly Media.
http://shop.oreilly.com/product/9781565925090.do
● Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining.
Simon Munzert, Christian Rubba, Petter Meibner, Dominic Nyhuis. John Wiley.
https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X
● HTML & CSS: Design and Build Web Sites. Jon Duckett. John Wiley.
https://www.amazon.co.uk/gp/product/1118008189
References