Alexander Sibiryakov- Frontera

Frontera: open source, large scale web crawling framework

Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016 [email protected]

mailto:[email protected]

• Software Engineer @ Scrapinghub

• Born in Yekaterinburg, RU

• 5 years at Yandex, search quality department: social and QA search, snippets.

• 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts.

About myself

2

• Over 2 billion requests per month (~800/sec.)

• Focused crawls & Broad crawls

We help turn web content into useful data

3

{ "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/2015/oct/05/world-bank-extreme-poverty-to-fall-below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item?id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user?id=hliyan" } },

Broad crawl usages

4

Lead generation (extracting contact

information)News analysis

Topical crawling

Plagiarism detection Sentiment analysis (popularity, likability)

Due diligence (profile/business data)

Track criminal activity & find lost persons (DARPA)

Saatchi Global Gallery Guidewww.globalgalleryguide.com

• Discover 11K online galleries.

• Extract general information, art samples, descriptions.

• NLP-based extraction. • Find more galleries on the

web.

http://www.globalgalleryguide.com/

Frontera recipes• Multiple websites data collection automation

• «Grep» of the internet segment

• Topical crawling

• Extracting data from arbitrary document

Multiple websites data collection automation

• Scrapers from multiple websites.

• Data items collected and updated.

• Frontera can be used to

• crawl in parallel and scale the process,

• schedule revisiting (within fixed time),

• prioritize the URLs during crawling.

«Grep» of the internet segment• alternative to Google,

• collect the zone files from registrars (.com/.net/.org),

• setup Frontera in distributed mode,

• implement text processing in spider code,

• output items with matched pages.

Topical crawling• document topic classifier & seeds URL list,

• if document is classified as positive crawler -> extracted links,

• Frontera in distributed mode,

• topic classifier code put in spider.

Extensions: link classifier, follow/final classifiers.

Extracting data from arbitrary document

• Tough problem. Can’t be solved completely.

• Can be seen as a structured prediction problem:

• Conditional Random Fields (CRF) or

• Hidden Markov Models (HMM).

• Tagged sequence of tokens and HTML tags can be used to predict the data fields boundaries.

• Webstruct and WebAnnotator Firefox extension.

Task• Spanish web: hosts and

their sizes statistics. • Only .es ccTLD. • Breadth-first strategy:

• first 1-click environ, • 2, • 3, • …

• Finishing condition: 100 docs from host max., all hosts

• Low costs.

11

Spanish, Russian, German and world Web in 2012

12

Domains Web servers Hosts DMOZ*

Spanish (.es) 1,5M 280K 4,2M 122KRussian

(.ru, .рф, .su) 4,8M 2,6M ? 105K

German (.de) 15,0M 3,7M 20,4M 466K

World 233M 62M 890M 3,9M

Sources: OECD Communications Outlook 2013, statdom.ru* - current period (October 2015)

http://statdom.ru

Solution• Scrapy (based on Twisted) - network operations.

• Apache Kafka - data bus (offsets, partitioning).

• Apache HBase - storage (random access, linear scanning, scalability).

• Twisted.Internet - library for async primitives for use in workers.

• Snappy - efficient compression algorithm for IO-bounded applications.

13

ArchitectureKafka topic

SW Crawling strategy workers

Storage workers

14

DB

1. Big and small hosts problem

• Queue is flooded with URLs from the same host.

• → underuse of spider resources.

• additional per-host (per-IP) queue and metering algorithm.

• URLs from big hosts are cached in memory.

15

2. DDoS DNS service Amazon AWS

Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.

Recursive DNS server • on every spider node,• upstream to Verizon &

OpenDNS.We used dnsmasq.

16

3. Tuning Scrapy thread pool’а for efficient DNS resolution

• OS DNS resolver,• blocking calls,• thread pool to resolve

DNS name to IP.• numerous errors and

timeouts 🆘• A patch for thread

pool size and timeout adjustment.

17

4. Overloaded HBase region servers during state check

• 10^3 links per doc,• state check: CRAWLED/NOT

CRAWLED/ERROR,• HDDs.• Small volume 🆗• With ⬆table size, response

times ⬆ • Disk queue ⬆• Host-local fingerprint

function for keys in HBase.• Tuning HBase block cache to

fit average host states into one block.

18

3Tb of metadata.URLs, timestamps,…275 b/doc

5. Intensive network traffic from workers to services

• Throughput between workers and Kafka/HBase ~ 1Gbit/s.

• Thrift compact protocol for HBase

• Message compression in Kafka with Snappy

19

6. Further query and traffic optimizations to HBase

• State check: lots of reqs and network

• Consistency• Local state cache

in strategy worker.• For consistency,

spider log was partitioned by host.

20

State cache• All ops are batched:

– If no key in cache→ read HBase

– every ~4K docs → flush

• Close to 3M (~1Gb) elms → flush & cleanup

• Least-Recently-Used (LRU) 👍

21

Spider priority queue (slot)• Cell:

Array of:- fingerprint, - Crc32(hostname), - URL, - score

• Dequeueing top N.

• Prone to huge hosts

• Scoring model: document count per host.

22

7. Problem of big and small hosts (strikes back!)

• Discovered few very huge hosts (>20M docs)

• All queue partitions were flooded with huge hosts,

• Two MapReduce jobs:– queue shuffling,– limit all hosts to

100 docs MAX.

23

Spanish (.es) internet crawl results

• fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites

• 68.7K domains found (~600K expected),

• 46.5M crawled pages overall,

• 1.5 months,

• 22 websites with more than 50M pages

24

http://fnac.es

http://rakuten.es

http://adidas.es

http://equiposdefutbol2014.es

http://druni.es

http://docentesconeducacion.es

where are the rest of web servers?!

Bow-tie modelA. Broder et al. / Computer Networks 33 (2000) 309-320

26

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

27

12 years dynamicsGraph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014

28

• Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel.

• Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:

• 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.

Hardware requirements (distributed backend+spiders)

29

Software requirements

30

Single process Distributed spiders Distributed backend

Python 2.7+, Scrapy 1.0.4+

sqlite or any other RDBMS HBase/RDBMS

- ZeroMQ or Kafka

- - DNS Service

• Online operation: scheduling of new batch, updating of DB state.

• Storage abstraction: write your own backend (sqlalchemy, HBase is included).

• Run modes: single process, distributed spiders, dist. backend

• Scrapy ecosystem: good documentation, big community, ease of customization.

Main features

31

• Message bus abstraction (ZeroMQ and Kafka are available out-of-the box).

• Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.

• Polite by design: each website is downloaded by at most one spider.

• Canonical URLs resolution abstraction: each document has many URLs, which to use?

• Python: workers, spiders.

Main features

32

ReferencesGitHub: https://github.com/scrapinghub/frontera

RTD: http://frontera.readthedocs.org/

Google groups: Frontera (https://goo.gl/ak9546)

33

https://github.com/scrapinghub/frontera

http://frontera.readthedocs.org/

https://goo.gl/ak9546

Future plans• Python 3 support,

• Docker images,

• Web UI,

• Watchdog solution: tracking website content changes.

• PageRank or HITS strategy.

• Own HTML and URL parsers.

• Integration into Scrapinghub services.

• Testing on larger volumes.

34

Run your business using Frontera

SCALABLE

OPEN CUSTOMIZABLE

Made in Scrapinghub (authors of Scrapy)

Contribute!• Web scale crawler,

• Historically first attempt in Python,

• Truly resource-intensive task: CPU, network, disks.

36

We’re hiring!http://scrapinghub.com/jobs/

37

http://scrapinghub.com/jobs/

38

Mandatory sales slideCrawl the web, at scale

• cloud-based platform

• smart proxy rotator Get data, hassle-free • off-the-shelf datasets

• turn-key web scraping

try.scrapinghub.com/PDB16

http://try.scrapinghub.com/PDB16

Questions?Thank you!

Alexander Sibiryakov, [email protected]

Data & Analytics

Alexander Sibiryakov- Frontera