Overview of python web scraping tools

Overview of Python web scraping tools

Maik RöderBarcelona Python Meetup Group

17.05.2012

Friday, May 18, 2012

Data Scraping

• Automated Process

• Explore and download pages

• Grab content

• Store in a database or in a text file

urlparse

• Manipulate URL strings

urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()

urllib

• Download data through different protocols

• HTTP, FTP, ...

urllib.parse()urllib.urlopen()urllib.urlretrieve()

Scrape a web site

• Example: http://www.wunderground.com/

Preparation

>>> from StringIO import StringIO>>> from urllib2 import urlopen>>> f = urlopen('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')

>>> p = f.read()>>> d = StringIO(p)>>> f.close()

Beautifulsoup

• HTML/XML parser

• designed for quick turnaround projects like screen-scraping

• http://www.crummy.com/software/BeautifulSoup

BeautifulSoup

from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')

[x['href'] for x in a]

Faster BeautifulSoup

from BeautifulSoup import *

p = SoupStrainer('a')

a = BeautifulSoup(d, parseOnlyThese=p)

[x['href'] for x in a]

Inspect the Element

• Inspect the Maximum temperature

Find the node

>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23

htmllib.HTMLParser

• Interesting only for historical reasons

• based on sgmllib

htmllib5• Using the custom simpletree format

• a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]

lxml• Library for processing XML and HTML

• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev

• Extends the ElementTree API

• e.g. with XPath

from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()

twill• Simple

• No JavaScript

• http://twill.idyll.org

• Some more interesting concepts

• Pages, Scenarios

• State Machines

• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()

>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()

Twill - acknowledge_equiv_refresh

>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")...twill.errors.TwillException: infinite refresh loop discovered; aborting.Try turning off acknowledge_equiv_refresh...

Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'

mechanize• Stateful programmatic web browsing

• navigation history

• HTML form state

• cookies

• ftp:, http: and file: URL schemes

• redirections

• proxies

• Basic and Digest HTTP authentication

mechanize - robots.txt>>> import mechanize>>> browser = mechanize.Browser()>>> browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')

mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

mechanize - robots.txt

• Do not handle robots.txtbrowser.set_handle_robots(False)

• Do not handle equivbrowser.set_handle_equiv(False)

browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')

Selenium

• http://seleniumhq.org

• Support for JavaScript

Selenium

from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time

Selenium

>>> browser = webdriver.Firefox() >>> browser.get("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")>>> a = browser.find_element_by_xpath("(//span[contains(@class,'nobr')])[position()=2]/span").textbrowser.close()>>> print a

Phantom JS

• http://www.phantomjs.org/

Overview of python web scraping tools

Technology

Scraping Con Python

THE TOP FREE AND PAID WEB SCRAPING TOOLS AND SERVICES · 2020. 10. 9. · PAGE 3 TOP WEB SCRAPING AND CRAWLING TOOLS 9 Octoparse Octoparse is a visual scraping tool with a point-and-click

Intro to web scraping with Python

Python Tools for Visual Studio: Python na Microsoftovom .NET-u

Essential Python Tools Documentation

Python tools for Excel

software carpentry - Python€¦ · Software carpentry: Python tools Pietro Berkes, Sept 2014 . Fixtures Software carpentry: Python tools ! Tests require an initial state or test

THE TOP FREE AND PAID WEB SCRAPING TOOLS AND SERVICES€¦ · AND CRAWLING TOOLS Web scraping and crawling tools are programs or automated scripts that browse websites and fetch new

COMP 4971C Independent Project Web Scraping …Web Scraping Website with Python for Database Construction HALIM, Kevin 8 2.1.2. Scraping the website Scraping the website GSMArena can

Scraping with Python for Fun and Profit - PyCon India 2010

Web scraping in python

Web Scraping in Python with Scrapy

Python, Toolboxes, Tools & Script Tools Accessing ArcGIS Geoprocessing tools with Python Creating a toolbox Adding script tools to a custom (TBX) or Python

An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018 · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

Web Scraping with Python - Sample Chapter

Web scraping tools, a real life application

Swiss Asylum Lottering - Swiss Python Summit · Swiss Asylum Lottering Scraping the Federal Administrative Court's Database and Analysing the Verdicts 17. Feb., Python Summit 2017,

Ethical hacking with Python tools

week5-R - TCPD · 2019. 9. 8. · Scraping: Recap Copy-n-paste Browser based tools Fetch web pages (with curl) and extract data Programmed web-scraper in R, Python. etc. Automated

Scraping with Python for Fun & Proﬁt · Scraping with Python for Fun & Proﬁt @PyCon India 2010 Abhishek Mishra hello@ideamonk.com