View
4.530
Download
4
Category
Tags:
Preview:
Citation preview
Web Scraping with
Matthew TurlandAcadiana Open Source Group
April 30, 2009
What Is It?
Normal Web Browsing
Difference #1: Immediate Audience
Difference #2: Consumption Method
Why Is ItUseful?
Data WithoutWeb Services
Integration Testing
Crawlers
With plain text, we give ourselves the
ability to manipulate knowledge, both
manually and programmatically, using
virtually every tool at our disposal.
3.14 The Power of Plain Text,The Pragmatic Programmer
Disadvantages
Potential Lack of Stability
Reverse Engineering Required
MoreRequests
No Nice NeatData Package
Step #1: Retrieval
Speaking the Language
The Web We Weave
GET / HTTP/1.1User-Agent: ...
HTTP/1.1 200 OKContent-Type: ...
GET /index.php?foo=bar HTTP/1.1
<a href="/index.php?foo=bar">Index</a>
<form method="post" action="/index.php"> <input name="foo" value="bar" /></form>
POST /index.php HTTP/1.1
foo=bar
Browsing → Requests
HTTP/1.1 200 OKContent-Type: image/gifContent-Length: 8558
Responses → Rendered Elements
<img src="/intl/en_ALL/images/logo.gif" />
GET /intl/en_ALL/images/logo.gif HTTP/1.1Host: google.com
Not As Easy As It Looks
Redirections
Referer [sic]
Cookies
User Agent Sniffing
robots.txt
Caching
HTTP Authentication
PHP: Glue for the Web
HTTP Client Libraries
PEAR::HTTP_Client
pecl_http
Zend_Http_Client
Streams, cURL
Simple Streams Example$uri = 'http://www.example.com/some/resource';$get = file_get_contents($uri);$context = stream_context_create( array( 'http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ) ));$post = file_get_contents($uri, false, $context);
pecl_http Example
$http = new HttpRequest($uri);$http->enableCookies();$http->setMethod(HTTP_METH_POST);$http->addPostFields(array('var1' => 'value1'));$http->setOptions( 'useragent' => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer'));$response = $http->send();$headers = $response->getHeaders();$body = $response->getBody();
pecl_http Request Pooling
$pool = new HttpRequestPool;foreach ($urls as $url) { $request = new HttpRequest($url, HTTP_METH_GET); $pool->attach($request);}$pool->send();foreach ($pool as $request) { echo $request->getUrl(), PHP_EOL; echo $request->getResponseBody(), PHP_EOL;}
HTTP Resources
RFC 2616 HyperText Transfer Protocol
RFC 3986 Uniform Resource Identifiers
"HTTP: The Definitive Guide" (ISBN 1565925092)
"HTTP Pocket Reference: HyperText Transfer Protocol"
(ISBN 1565928628)
"HTTP Developer's Handbook" (ISBN 0672324547) by
Chris Shiflett
Ben Ramsey's blog series on HTTP
Step #2:Analysis
Tidy Extension
$config = array('output-xhtml' => true);$tidy = tidy_parse_string($markupString, $config);$tidy = tidy_parse_file($markupFilePath, $config);$output = tidy_get_output($tidy);
DOM Extension
$doc = new DOMDocument;$doc->loadHTML($htmlString);$doc->loadHTMLFile($htmlFilePath);$listItems = $doc->getElementsByTagName('li');$xpath = new DOMXPath($doc);$listItems = $xpath->query('//ul/li');foreach ($listItems as $listItem) { echo $listItem->nodeValue, PHP_EOL;}
SimpleXML Extension
$sxe = new SimpleXMLElement($markupString);$sxe = new SimpleXMLElement($filePath, null, true);echo $sxe->body->ul->li[0], PHP_EOL;$children = $sxe->body->ul->li;$children = $sxe->body->ul->children();foreach ($children as $li) { echo $li, PHP_EOL;}echo $sxe->body->ul['id'];$attributes = $sxe->body->ul->attributes();foreach ($attributes as $name => $value) { echo $name, '=', $value, PHP_EOL;}
XMLReader Extension
$doc = XMLReader::xml($xmlString);$doc = XMLReader::open($filePath);while ($doc->read()) { if ($doc->nodeType == XMLReader::ELEMENT) { var_dump($doc->localName); var_dump($doc->hasValue); var_dump($doc->value); var_dump($doc->hasAttributes); var_dump($doc->getAttribute('id')); }}
CSS Selector Libraries
phpQuery
Simple HTML DOM Parser
Zend_Dom_Query
$doc1 = phpQuery::newDocumentFile($markupFilePath);$doc2 = phpQuery::newDocument($markupString);$listItems = pq('ul > li'); // uses $doc2$listItems = pq('ul > li', $doc1);
PCRE Extension
Best Practices
Approximate Human Behavior
Minimize Requests
Batch Jobs,Non-Peak Hours
Account for Unavailability
Aim for Parallelism
Validate Data
Test, Test, Test!
Questions
Recommended