Web Crawling Modeling with Scrapy Models #TDC2014

Scrapy Models

Web Crawling

- Capturar conteudo desestruturado da web HTML, XML(?), Texto puro…- Parsear, validar e armazenar- Automatizar o processo

{'links': [u'http://www.python.org/~guido/',

u'http://neopythonic.blogspot.com/',

u'http://www.artima.com/weblogs/index.jsp?

blogger=guido',

u'http://python-history.blogspot.com/',

u'http://www.python.org/doc/essays/cp4e.html',

u'http://www.twit.tv/floss11',

u'http://www.computerworld.com.au/index.php/id;

66665771',

u'http://www.stanford.edu/class/ee380/Abstracts/081105.

html',

u'http://stanford-online.stanford.

edu/courses/ee380/081105-ee380-300.asx'],

'name': u'Guido van Rossum',

'nationality': u'Dutch',

'photo_url': 'http://en.m.wikipedia.org//wiki/File:

Guido_van_Rossum_OSCON_2006.jpg',

'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}

PYTHON ROCKS!!!

- LXML- HTMLParser- Beautiful Soup- Scrapy- XMLToDict- Requests

Beautiful Soupimport requestsfrom bs4 import BeautifulSouphtml = requests.get(“http://schblaums.com”)soup = BeautifulSoup(html)user_picture = soup.find_all(“img”)[<Soup Object …. >, ...]user_picture[0].expand<img src=”/user/picture.jpg” />

from lxml import etreetree = etree.parse(html)user_pictures = tree.xpath(“*./img”)[<tree node>, <tree node>, …]

CSSjQuery

Scrapy

- Framework de web crawling- Automatização do processo- Validação- Mapeamento de dados- Seletores com suporte a XPATH ou CCC

Scrapy Model

from mongoengine|django|* import Modelclass Person(Model): name = StringField() links = ListField() picture = ImageField()

{'links': [u'http://www.python.org/~guido/',

u'http://neopythonic.blogspot.com/',

u'http://www.artima.com/weblogs/index.jsp?

blogger=guido',

u'http://python-history.blogspot.com/',

u'http://www.python.org/doc/essays/cp4e.html',

u'http://www.twit.tv/floss11',

u'http://www.computerworld.com.au/index.php/id;

66665771',

u'http://www.stanford.edu/class/ee380/Abstracts/081105.

html',

u'http://stanford-online.stanford.

edu/courses/ee380/081105-ee380-300.asx'],

'name': u'Guido van Rossum',

'nationality': u'Dutch',

'photo_url': 'http://en.m.wikipedia.org//wiki/File:

Guido_van_Rossum_OSCON_2006.jpg',

'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}

What is scrapy_model ?It is just a helper to create scrapers using the Scrapy Selectors allowing you to select elements by CSS or by XPATH and structuring your scraper via Models (just like an ORM model) and plugable to an ORM model via populate method.

Import the BaseFetcherModel, CSSField or XPathField (you can use both)

from scrapy_model import BaseFetcherModel, CSSField

Go to a webpage you want to scrap and use chrome dev tools or firebug to figure out the css paths then

considering you want to get the following fragment from some page.

<span id="person">Bruno Rocha <a href="http://brunorocha.org">website</a></span>

class MyFetcher(BaseFetcherModel):

name = CSSField('span#person')

website = CSSField('span#person a')

# XPathField('//xpath_selector_here')

Multiple queries in a single field

You can use multiple queries for a single field

name = XPathField(

['//*[@id="8"]/div[2]/div/div[2]/div[2]/ul',

'//*[@id="8"]/div[2]/div/div[3]/div[2]/ul']

In that case, the parsing will try to fetch by the first query and returns if finds a match, else it will try the subsequent

queries until it finds something, or it will return an empty selector.

Finding the best match by a query validator

If you want to run multiple queries and also validates the best match you can pass a validator function which will take the scrapy

selector an should return a boolean.

Example, imagine you get the "name" field defined above and you want to validates each query to ensure it has a 'li' with a text

"Schblaums" in there.

def has_schblaums(selector):

for li in selector.css('li'): # takes each <li> inside the ul selector

li_text = li.css('::text').extract() # Extract only the text

if "Schblaums" in li_text: # check if "Schblaums" is there

return True # this selector is valid!

return False # invalid query, take the next or default value

class Fetcher(....):

name = XPathField(

['//*[@id="8"]/div[2]/div/div[2]/div[2]/ul',

'//*[@id="8"]/div[2]/div/div[3]/div[2]/ul'],

query_validator=has_schblaums, default="undefined_name")

Every method named parse_<field> will run after all the fields are fetched for each field.

def parse_name(self, selector):

# here selector is the scrapy selector for 'span#person'

name = selector.css('::text').extract()

return name

def parse_website(self, selector):

# here selector is the scrapy selector for 'span#person a'

website_url = selector.css('::attr(href)').extract()

return website_url

after defined need to run the scraper

fetcher = Myfetcher(url='http://.....') # optionally you can use cached_fetch=True to cache requests on redis

fetcher.parse()

https://github.com/rochacbruno/scrapy_model

Web Crawling Modeling with Scrapy Models #TDC2014

Internet

Web Scraping in Python with Scrapy

Chaos in test | TDC2014 Floripa | Chaordic

Tdc2014 Tizen 3d Ui Dali

Growth - TDC2014

Testes de Performance na Nuvem | TDC2014

Focused Crawling with Scalable Ordinal Regression Solverssaketh/research/icml07slides.pdf · Focused Crawling Focused Crawling Focused Crawling Given a topic (seed pages) ﬁnd out

NoSQL + SQL = PostgreSQL (TDC2014 - São Paulo/SP)

Scrapy - An open source web scraping framework for Pythonfiles.meetup.com/6816242/(Pycon Taipei) Scrapy-20130328.pdf2013/03/28 · • Scrapy is a fast high-level screen scraping

Greeny Scrapy

«Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy Documentationmedia.readthedocs.org/pdf/scrapy/0.14/scrapy.pdf · – robots.txt – crawl depth restriction – and more •Robust encoding support and auto-detection, for

Scrapy Docs

Chaos in Test - TDC2014 Porto Alegre

Getting started with Scrapy in Python

Scrapy Documentation - Read the Docs · 2019-04-02 · Download Scrapy from theDownload page. Scrapy is distributed in two ways: a source code tarball (for Unix and Mac OS X systems)

NoSQL + SQL = PostgreSQL (TDC2014 - Porto Alegre/RS)

Frontera: Large-Scale Open Source Web Crawling … · Apache Nutch instead of ... • And we’re friends forever! Frontera and Scrapy 10 ... Frontera-Open Source Large Scale Web

Everything of Crawling for Big Data Analysis · 2018-07-06 · packages(scrapy, beautyfulsoup) 를제공하였고, 이후, R, Java 등왠만한주요언어에서는scraping이가능하게되었다

TDC2014 - Internet das Coisas - Arduino & OpenDevice

Capturando a web com Scrapy