Overview of python web scraping tools

Preview:

DESCRIPTION

A talk I gave at the Barcelona Python Meetup May 2012.

Citation preview

Overview of Python web scraping tools

Maik RöderBarcelona Python Meetup Group

17.05.2012

Friday, May 18, 2012

Data Scraping

• Automated Process

• Explore and download pages

• Grab content

• Store in a database or in a text file

Friday, May 18, 2012

urlparse

• Manipulate URL strings

urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()

Friday, May 18, 2012

urllib

• Download data through different protocols

• HTTP, FTP, ...

urllib.parse()urllib.urlopen()urllib.urlretrieve()

Friday, May 18, 2012

Scrape a web site

• Example: http://www.wunderground.com/

Friday, May 18, 2012

Beautifulsoup

• HTML/XML parser

• designed for quick turnaround projects like screen-scraping

• http://www.crummy.com/software/BeautifulSoup

Friday, May 18, 2012

BeautifulSoup

from BeautifulSoup import *

a = BeautifulSoup(d).findAll('a')

[x['href'] for x in a]

Friday, May 18, 2012

Faster BeautifulSoup

from BeautifulSoup import *

p = SoupStrainer('a')

a = BeautifulSoup(d, parseOnlyThese=p)

[x['href'] for x in a]

Friday, May 18, 2012

Inspect the Element

• Inspect the Maximum temperature

Friday, May 18, 2012

Find the node

>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23

Friday, May 18, 2012

htmllib.HTMLParser

• Interesting only for historical reasons

• based on sgmllib

Friday, May 18, 2012

htmllib5• Using the custom simpletree format

• a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]

Friday, May 18, 2012

lxml• Library for processing XML and HTML

• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev

• Extends the ElementTree API

• e.g. with XPath

Friday, May 18, 2012

lxml

from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()

Friday, May 18, 2012

twill• Simple

• No JavaScript

• http://twill.idyll.org

• Some more interesting concepts

• Pages, Scenarios

• State Machines

Friday, May 18, 2012

twill

• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()

Friday, May 18, 2012

Twill

>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()

Friday, May 18, 2012

Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'

Friday, May 18, 2012

mechanize• Stateful programmatic web browsing

• navigation history

• HTML form state

• cookies

• ftp:, http: and file: URL schemes

• redirections

• proxies

• Basic and Digest HTTP authentication

Friday, May 18, 2012

Selenium

• http://seleniumhq.org

• Support for JavaScript

Friday, May 18, 2012

Selenium

from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time

Friday, May 18, 2012

Phantom JS

• http://www.phantomjs.org/

Friday, May 18, 2012

Recommended