View
11
Download
0
Category
Preview:
Citation preview
Using web scraping for Applied Economics
Morgan Raux
Empirical and Econometric Methods SessionsDecember 4, 2018
Using web scraping for Applied Economics Morgan Raux 1 / 22
Outline Introduction Issues Description
1 Why using web-scraped data?
2 Web-scraping: contributions and issues
3 The technology: how it works? What are the main challenges?
Using web scraping for Applied Economics Morgan Raux 2 / 22
Outline Introduction Issues Description
Context
Web scraping: a programming method to collect data online.Automatize copy/paste of websites’ source code.
Advantages:
It gives access to new sources of data,
At a very large scale,
At low (monetary) cost (compared to other data collectiontechniques)
Drawbacks:
Information is limited, taking advantage from these data isdifficult.
Collecting data is time costly.
Using web scraping for Applied Economics Morgan Raux 3 / 22
Outline Introduction Issues Description
Data sources on the web
Sources of data Application examples References
Social Media Studying migrations through Twitterand Facebook’s data.
Zagheni et al. (2015, 2016)
Job boards Studying job search using data fromIndeed and CareerBuilder.
Marinescu et al. (2017, 2018)
Sharing platforms Assessing discriminations with dataon Airbnb.
Laouenan & Rathelot (2017)
Reviewing platforms Measuring business cycle viarestaurant openings on Yelp.
Glaeser et al. (2017)
Using web scraping for Applied Economics Morgan Raux 4 / 22
Outline Introduction Issues Description
Assessing the data contribution
Most important question:
Does scraped data bring a contribution to the research comparedto any other source of data I could use?
Compared to:
Usual data sources (surveys, administrative data, etc.).
Other internet data (especially research projects that have adirect access to the website’s database)
Using web scraping for Applied Economics Morgan Raux 5 / 22
Outline Introduction Issues Description
Assessing the data contribution
I focus on the website I amscraping (Case 1)
I am scraping this website toget data on anotherphenomenon (Case 2)
Are the data scraped self-sufficient for the analysis?
If not:
Can I obtain direct access tothese data (and more) bycontacting the website?
Can I match them to otherdata sources?
↪→ Would the website beinterested by my research?
↪→ What is the right unit ofanalysis?
Using web scraping for Applied Economics Morgan Raux 6 / 22
Outline Introduction Issues Description
What is web scraping?
Web-scraping: automatize copy/paste of websites’ source code.
1) Collect the data,
1. Connect to a webpage gathering the information of interest,
2. Copy / paste the source code in a .txt file,
3. Loop over all webpages.
2) Parse the data,
1. Open the first .txt file,
2. Identify the information of interest,
3. Transfer this information into a dataframe,
4. Loop over all .txt files.
Using web scraping for Applied Economics Morgan Raux 7 / 22
Outline Introduction Issues Description
The database I want to obtain
Indeed Job listing database
Using web scraping for Applied Economics Morgan Raux 8 / 22
Outline Introduction Issues Description
Collect the data: (1) Connect to the webpage
Using web scraping for Applied Economics Morgan Raux 9 / 22
Outline Introduction Issues Description
Collect the data: (2) Copy/Paste the source code
Vizualizing the source code
Using web scraping for Applied Economics Morgan Raux 10 / 22
Outline Introduction Issues Description
Collect the data: (2) Copy/Paste the source code
Vizualizing the source code
Using web scraping for Applied Economics Morgan Raux 11 / 22
Outline Introduction Issues Description
Collect the data: (3) Loop over all webpages
Using web scraping for Applied Economics Morgan Raux 12 / 22
Outline Introduction Issues Description
Collect the data: (3) Loop over all webpages
Url are constructed in a uniform way:
https://www.indeed.com/︸ ︷︷ ︸Domain name
jobs? q=Economist︸ ︷︷ ︸Job
& l=Boston%+MA︸ ︷︷ ︸Location
& start=0︸ ︷︷ ︸Page number
To collect the whole Indeed’s job listing for the US:
1. Loop over jobs
2. Loop over locations
3. Loop over pages
Remarks:
Websites prevent you from accessing the whole information
The way you organize your loops enables to address thischallenge
Using web scraping for Applied Economics Morgan Raux 13 / 22
Outline Introduction Issues Description
Collect the data: (3) Loop over all webpages
Url are constructed in a uniform way:
https://www.indeed.com/︸ ︷︷ ︸Domain name
jobs? q=Economist︸ ︷︷ ︸Job
& l=Boston%+MA︸ ︷︷ ︸Location
& start=0︸ ︷︷ ︸Page number
To collect the whole Indeed’s job listing for the US:
1. Loop over jobs
2. Loop over locations
3. Loop over pages
Remarks:
Websites prevent you from accessing the whole information
The way you organize your loops enables to address thischallenge
Using web scraping for Applied Economics Morgan Raux 13 / 22
Outline Introduction Issues Description
Collect the data: (3) Loop over all webpages
Url are constructed in a uniform way:
https://www.indeed.com/︸ ︷︷ ︸Domain name
jobs? q=Economist︸ ︷︷ ︸Job
& l=Boston%+MA︸ ︷︷ ︸Location
& start=0︸ ︷︷ ︸Page number
To collect the whole Indeed’s job listing for the US:
1. Loop over jobs
2. Loop over locations
3. Loop over pages
Remarks:
Websites prevent you from accessing the whole information
The way you organize your loops enables to address thischallenge
Using web scraping for Applied Economics Morgan Raux 13 / 22
Outline Introduction Issues Description
Parse the data: Identify the information of interest
Correspondance between webpage and source code
Using web scraping for Applied Economics Morgan Raux 14 / 22
Outline Introduction Issues Description
Two necessary conditions
1) Information must be accessible in the source code:
Collect and parse html code is easy
Collect and parse javascript code is feasible but much morecomplicated...
2) Information must be organized in a uniform way:
Depending on the coding language of the website, tags canidentify the information of interest.
If no precise tags, it has to follow some patterns.
Using web scraping for Applied Economics Morgan Raux 15 / 22
Outline Introduction Issues Description
Information must be organized in a uniform way:
Clean source code (Indeed) Dirty source code (EJM)
Using web scraping for Applied Economics Morgan Raux 16 / 22
Outline Introduction Issues Description
Technical issues with web-scraping
1. Matching data with other sources (no common identifier)
2. Escape from being blocked/banned by websites.
3. Stock the (Big) data.
4. Legal issues.
Using web scraping for Applied Economics Morgan Raux 17 / 22
Outline Introduction Issues Description
Technical issues with web-scraping (1)
1) Matching data with other sources
Information obtained from one website is most of the timevery partial
There is never a common identifier across websites
This implies to think about the optimal unit of analysis tomerge together different sources of data.
Using web scraping for Applied Economics Morgan Raux 18 / 22
Outline Introduction Issues Description
Technical issues with web-scraping (2)
2) Escaping the black-list
Web scraping implies a large number of requests on thewebpage
Most websites defend theirselves against such behaviors (byblocking temporarily the IP address, or banning definitively ...)
To prevent from these risks, your code must mimic humanbehaviors
Stop for a few minutes the scraping processLoop over different IP addresses (proxies, TOR, remoteservers...)
Using web scraping for Applied Economics Morgan Raux 19 / 22
Outline Introduction Issues Description
Technical issues with web-scraping (3)
3) Stock the (Big) data
Because web scraping allows you to collect data at a very largescale. Therefore, you often ends up with large amount of data
It necessitates large memory space to stock these data. Don’tforget to save a backup copy !!!!
Technical solution can be to rent servers online.
Using web scraping for Applied Economics Morgan Raux 20 / 22
Outline Introduction Issues Description
Technical issues with web-scraping (4)
4) Legal issues
Most of the time, websites’ Terms of Use explicitely forbid youto web-scrape their data.
In France, the Loi Lemaire (2016) allows researchers to exploitproprietary data. Elsewhere, the legal framework is not clear
There are two basic requirements to limit legal issues:
Do not harm the website functionningDo not use these data for a commercial activity
Do not scrape LinkedIn !!
Using web scraping for Applied Economics Morgan Raux 21 / 22
Outline Introduction Issues Description
Technical issues with web-scraping (4)
4) Legal issues
Most of the time, websites’ Terms of Use explicitely forbid youto web-scrape their data.
In France, the Loi Lemaire (2016) allows researchers to exploitproprietary data. Elsewhere, the legal framework is not clear
There are two basic requirements to limit legal issues:
Do not harm the website functionningDo not use these data for a commercial activityDo not scrape LinkedIn !!
Using web scraping for Applied Economics Morgan Raux 21 / 22
Outline Introduction Issues Description
Learning ressources
Programming language:
Python
R
Resources:
MOOC: Using Python to Access Web Data (Coursera)
Book: Web Scraping with Python, by Ryan Mitchel
Using web scraping for Applied Economics Morgan Raux 22 / 22
Recommended