Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from

Retrieval and Extractionfrom Big Data Sources

1. Information Sources, Retrieval and Extraction

2. Connecting to big data sources via API

3. Retrieving information from websites (web scraping)

Outline




Outline

There are almost infinite different sources, but they can be grouped into:

- Search engines

- RSS channels

- Open data

- Social media

Information sources- Source types

Information sources- Search engines

There are different kinds of search engines:

- General

- Google, Yahoo!, Bing...

- Thematic

- Carrot2: http://search.carrot2.org

- Patents

- National Agencies (e.g. http://consultas2.oepm.es/InvenesWeb)

- Google Patents (https://patents.google.com)

- Legal

- The Public Library of Law (http://www.plol.org)

- National Agencies (e.g. http://www.poderjudicial.es/search/indexAN.jsp)

- ...

Information sources- Search engines types

http://search.carrot2.org

http://consultas2.oepm.es/InvenesWeb

https://patents.google.com

http://www.plol.org

http://www.poderjudicial.es/search/indexAN.jsp

Information sources- RSS channels

Rich Site Summary or Really Simple Sindication, publish the latest news of the site with full or summarized text and metadata, like publishing data or author’s name.

Information sources- Open data

- National agencies (e.g. http://www.ine.es) - Eurostat (http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database) - U.S. Census Bureau

(http://www.census.gov/population/international/data/idb/informationGateway.php)- U.S. Census Bureau - List to national statistics agencies

(http://www.census.gov/population/international/links/stat_int.html)- World Bank (http://data.worldbank.org)- United Nations (http://data.un.org)- CEPAL statistics (http://websie.eclac.cl/infest/ajax/cepalstat.asp)- Asian-Pacific statistics

(http://www.unescap.org/stat/data/swweb_syb2011/DataExplorer.aspx) - OMPI Patents (http://www.wipo.int/patentscope/search/en/search.jsf) - OMPI Brands (http://www.wipo.int/madrid/en/romarin)

Information sources- Open data

http://www.ine.es

http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database

http://www.census.gov/population/international/data/idb/informationGateway.php

http://www.census.gov/population/international/links/stat_int.html

http://data.worldbank.org

http://data.un.org

http://websie.eclac.cl/infest/ajax/cepalstat.asp

http://www.unescap.org/stat/data/swweb_syb2011/DataExplorer.aspx

http://www.wipo.int/patentscope/search/en/search.jsf

http://www.wipo.int/madrid/es/romarin

Information sources- Social media




Outline

Connecting via API - HTTP protocol

● Application layer protocol

● Syntax and semantics for web

communication.

● HTTP/1.0 & HTTP/1.1

● Disconnected protocol

● Based on:

○ request <-> response

● Plain text messages

● There are no states

● There are no session

WEB EMAIL FTP NEWS

HTTP POP3 SMTP FTP NEWS

TCP/IP

PHYSICAL NET

request

response

request

response

Connecting via API - HTTP messages

REQUEST RESPONSE

GET /tramitation.jsp HTTP/1.1

Host www.uv.es

CLRF

Data

Empty

HTTP/1.1 200 OK

Content-Type: text/html

Content-Length: 45

CLRF

HTML + Img + …

Empty

HTTP MESSAGE

● Initial line

● Header

● CLRF

● Body

www.uv.es

Connecting via API - HTTP request methods

Two HTTP Request Methods:

- GET: Requests data from a specified resource

- POST: Submits data to be processed to a specified resource

https://www.w3schools.com/tags/ref_httpmethods.asp

https://www.w3schools.com/tags/ref_httpmethods.asp

Connecting via API - HTTP response codes

1xx: Informative Messages

- 101 Continue

- 102 Switching Protocols

2xx: Success

- 200 OK

- 201 Created

- 202 Accepted

- 204 No Content

- 206 Partial Content

3xx: Redirection

- 300 Multiple Choice

- 301 Moved Permanently

- 302 Not Found

- 304 Not Modified

4xx: Client Error

- 400 Bad Request

- 401 Unauthorized

- 403 Forbidden

- 404 Not Found

- 405 Method Not Allowed

- 408 Request TimeOut

5xx: Server Error

- 500 Internal Server Error

- 501 Not Implemented

- 502 Bad Gateway

- 503 Service Unavailable

Connecting via API - Example: EuroStat API

http://ec.europa.eu/eurostat/web/json-and-unicode-web-services

http://www.scb.se/en_/









http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_H

AB&time=2010&time=2011&indic_na=B1GMz

URL http://ec.europa.eu/eurostat/wdds

Service rest/data

Version v2.1

Format json

Lang en

Dataset nama_gdp_c

Filters precision=1

geo=EU28 <- European Union (28 countries)

unit=EUR_HAB <- Euros per inhabitant

time=2010

time=2011 <- Years 2010 and 2011

indic_na=B1GMz <- National Account Indicator: gross domestic product at market prices




{"version":"2.0","label":"GDP and main components - Current prices","href":"http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/nama_gdp_c?precision=1&geo=EU28&unit=EUR_HAB&time=2010&time=2011&indic_na=B1GM","source":"Eurostat","updated":"2016-02-10","extension":{"datasetId":"nama_gdp_c","lang":"EN","description":null,"subTitle":null,"status":{"label":null}},"class":"dataset","value":{"0":24400,"1":25100},"dimension":{"unit":{"label":"unit","category":{"index":{"EUR_HAB":0},"label":{"EUR_HAB":"Euro per inhabitant"}}},"indic_na":{"label":"indic_na","category":{"index":{"B1GM":0},"label":{"B1GM":"Gross domestic product at market prices"}}},"geo":{"label":"geo","category":{"index":{"EU28":0},"label":{"EU28":"European Union (28 countries)"}}},"time":{"label":"time","category":{"index":{"2010":0,"2011":1},"label":{"2010":"2010","2011":"2011"}}}},"id":["unit","indic_na","geo","time"],"size":[1,1,1,2]}

Response in JSON format


Connecting via API - JSON format

https://www.hurl.it/

JSON: JavaScript Object Notation.

- A syntax for storing and exchange data.

- Text written in JavaScript object notation.

- Most web messages nowadays are written in JSON.

- Low payload.

- Easily converted into JavaScript objects.

- Hardly understood by humands at glance ---> needs formatter.

https://www.hurl.it/

Connecting via API - JSON formatter

http://jsonlint.com

Paste JSON code in

the text area and

click Validate JSON

http://jsonlint.com

Connecting via API - JSON formatter

http://jsonlint.com

The result is more human

readable. It is a hierachical

format:

[ ] array

{ } object

“key”: “value” attributes in key-

value format

http://jsonlint.com




Outline

Webscraping- Definition

“

Web scraping, web harvesting or web data extraction is data scraping used

for extracting data from websites.

Web scraping software may access the World Wide Web directly using the

Hypertext Transfer Protocol (HTTP).

It is a form of copying, in which specific data is gathered and copied from the

web, typically into a central local database or spreadsheet, for later retrieval

or analysis.

”

https://en.wikipedia.org/wiki/Web_scraping

https://en.wikipedia.org/wiki/Web_scraping

Webscraping- rvest

rvest is new package that makes it easy to

scrape (or harvest) data from html web pages,

inspired by libraries like beautiful soup. It is

designed to work with magrittr so that you can

express complex operations as elegant

pipelines composed of simple, easily

understood pieces.There are some dependencies to solve:

- libcur14-openssl-dev

- libssl-dev

- libxml2-dev

http://www.crummy.com/software/BeautifulSoup/

https://github.com/smbache/magrittr

Webscraping- Example: LEGO Film

http://www.imdb.com/title/tt1490017/

Let’s extract the following information from

the LEGO film:

- Users rating

- Cast

http://www.imdb.com/title/tt1490017/

Webscraping- HTML + CSS

HTML: HyperText Marckup Language

- A language to create webpages.

- Tags-based syntax.

- E.g. <body></body><table></table>...

- Should describe contents

CSS: Cascading Style Sheets

- A language to describe how HTML elements should be displayed on screen, paper or in other media.

- Key-value-based syntax associated to HTML elements.

- E.g. body {background-color: black; color: white; font: verdana; }

Webscraping- CSS selector

http://selectorgadget.com

There are several CSS selectors. Let’s install SelectorGradget

for Chrome.

http://selectorgadget.com

Webscraping- CSS selector

Click on “Add to Chrome” and accept.

A new icon will appear in Chrome toolbar.

Let’s click on it.

Move the mouse over the rating and click.

Copy the CSS value.

Write the next code in R.

Webscraping- Scraping rating

Webscraping- Scraping cast

To scrap the cast it is

needed to make the css

selection twice:

- Select the table

- Select the column

Then, write the following R

code.

Webscraping- Example: gold prices

https://www.measuringworth.com/datasets/gold/result.php


Webscraping- Example: gold prices


To obtain the table of results, the

user has to choose some

parameters:

- A series of markets

- Initial year

- Ending year

Hence, this is a dynamic webpage

where the user send parameters

in a request to the server. There

are two possibilities:

- GET

- POST


Webscraping- Firebug Tool

http://getfirebug.com/releases/lite/chrome/

Firebug allows to inspect HTML

source code

http://getfirebug.com/releases/lite/chrome/

Webscraping- Firebug Tool

Click the Firebug icon on the browser toolbar.

An inspection window opens below the webpage.

Click the button Inspect and inspect the web.

Webscraping- Inspecting the form

When inspecting the form, we observe

that the method is POST.

Webscraping- Looking for parameters

The needed parameters are the

following:

- london

- goldsilver

- newyork

- us

- British

- year_source

- year_result

In order to make a POST request in R, the following libraries are needed:

- rvest

- httr

Webscraping- Required libraries

Webscraping- POST request

Instead of making a direct reques to the url, that by default is made as GET,

we need to enconde the request within a POST object with the following

parameters:

- The url

- The query (parameters) as a list of pairs key=value

Then, the request is made with the content function:

Webscraping- Results formatting

To obtain the dataset a chain of functions (%>%) is used:

- html_prices contains the webpage html

- html_nodes(“table”) selects the list of elements below the table CSS

element

- .[[2]] selects the second element in the list

- html_tables converts to a R table

The result is stored in prices, the first row removed and the column names

assigned.

Webscraping- The whole R code

● HTTP: The Definitive Guide. David Gourley, Brian Totty, Marjorie Sayer, Anshu

Aggarwal, Saily Reddy. O’Reilly Media.

http://shop.oreilly.com/product/9781565925090.do

● Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining.

Simon Munzert, Christian Rubba, Petter Meibner, Dominic Nyhuis. John Wiley.

https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X

● HTML & CSS: Design and Build Web Sites. Jon Duckett. John Wiley.

https://www.amazon.co.uk/gp/product/1118008189

References

http://shop.oreilly.com/product/9781565925090.do

https://www.amazon.com/Automated-Data-Collection-Practical-Scraping/dp/111883481X

https://www.amazon.co.uk/gp/product/1118008189

Documents

Retrieval and Extraction from Big Data Sources ESTP... · 1. Information Sources, Retrieval and Extraction 2. Connecting to big data sources via API 3. Retrieving information from