View
14.787
Download
4
Category
Preview:
DESCRIPTION
A talk I gave at the Barcelona Python Meetup May 2012.
Citation preview
Overview of Python web scraping tools
Maik RöderBarcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
Data Scraping
• Automated Process
• Explore and download pages
• Grab content
• Store in a database or in a text file
Friday, May 18, 2012
urlparse
• Manipulate URL strings
urlparse.urlparse()urlparse.urljoin()urlparse.urlunparse()
Friday, May 18, 2012
urllib
• Download data through different protocols
• HTTP, FTP, ...
urllib.parse()urllib.urlopen()urllib.urlretrieve()
Friday, May 18, 2012
Scrape a web site
• Example: http://www.wunderground.com/
Friday, May 18, 2012
Preparation
>>> from StringIO import StringIO>>> from urllib2 import urlopen>>> f = urlopen('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()>>> d = StringIO(p)>>> f.close()
Friday, May 18, 2012
Beautifulsoup
• HTML/XML parser
• designed for quick turnaround projects like screen-scraping
• http://www.crummy.com/software/BeautifulSoup
Friday, May 18, 2012
BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Friday, May 18, 2012
Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Friday, May 18, 2012
Inspect the Element
• Inspect the Maximum temperature
Friday, May 18, 2012
Find the node
>>> from BeautifulSoup import BeautifulSoup>>> soup = BeautifulSoup(d)>>> attrs = {'class':'nobr'}>>> nobrs = soup.findAll(attrs=attrs)>>> temperature = nobrs[3].span.string>>> print temperature23
Friday, May 18, 2012
htmllib.HTMLParser
• Interesting only for historical reasons
• based on sgmllib
Friday, May 18, 2012
htmllib5• Using the custom simpletree format
• a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parsefrom html5lib import treebuilderse = treebuilders.simpletree.Elementi = parse(d)a =[x for x in d if isinstance(x, e) and x.name= 'a'][x.attributes['href'] for x in a]
Friday, May 18, 2012
lxml• Library for processing XML and HTML
• Based on C librariessudo aptitude install libxml2-devsudo aptitude install libxslt-dev
• Extends the ElementTree API
• e.g. with XPath
Friday, May 18, 2012
lxml
from lxml import etreet = etree.parse('t.xml')for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent()
Friday, May 18, 2012
twill• Simple
• No JavaScript
• http://twill.idyll.org
• Some more interesting concepts
• Pages, Scenarios
• State Machines
Friday, May 18, 2012
twill
• Commonly used methods:go() code() show() showforms() formvalue() (or fv()) submit()
Friday, May 18, 2012
Twill
>>> from twill import commands as twill>>> from twill import get_browser>>> twill.go('http://www.google.com')>>> twill.showforms()>>> twill.formvalue(1, 'q', 'Python')>>> twill.showforms()>>> twill.submit()>>> get_browser().get_html()
Friday, May 18, 2012
Twill - acknowledge_equiv_refresh
>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")...twill.errors.TwillException: infinite refresh loop discovered; aborting.Try turning off acknowledge_equiv_refresh...
Friday, May 18, 2012
Twill>>> twill.config("acknowledge_equiv_refresh", "false")>>> twill.go("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")==> at http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html'
Friday, May 18, 2012
mechanize• Stateful programmatic web browsing
• navigation history
• HTML form state
• cookies
• ftp:, http: and file: URL schemes
• redirections
• proxies
• Basic and Digest HTTP authentication
Friday, May 18, 2012
mechanize - robots.txt>>> import mechanize>>> browser = mechanize.Browser()>>> browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
Friday, May 18, 2012
mechanize - robots.txt
• Do not handle robots.txtbrowser.set_handle_robots(False)
• Do not handle equivbrowser.set_handle_equiv(False)
browser.open('http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html')
Friday, May 18, 2012
Selenium
• http://seleniumhq.org
• Support for JavaScript
Friday, May 18, 2012
Selenium
from selenium import webdriverfrom selenium.common.exceptions \ import NoSuchElementExceptionfrom selenium.webdriver.common.keys \ import Keysimport time
Friday, May 18, 2012
Selenium
>>> browser = webdriver.Firefox() >>> browser.get("http://www.wunderground.com/history/airport/BCN/2007/5/17/DailyHistory.html")>>> a = browser.find_element_by_xpath("(//span[contains(@class,'nobr')])[position()=2]/span").textbrowser.close()>>> print a
23
Friday, May 18, 2012
Phantom JS
• http://www.phantomjs.org/
Friday, May 18, 2012
Recommended