39
Frontera: open source, large scale web crawling framework Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016 [email protected]

Alexander Sibiryakov- Frontera

  • Upload
    pydata

  • View
    138

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexander Sibiryakov- Frontera

Frontera: open source, large scale web crawling framework

Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016 [email protected]

Page 2: Alexander Sibiryakov- Frontera

• Software Engineer @ Scrapinghub

• Born in Yekaterinburg, RU

• 5 years at Yandex, search quality department: social and QA search, snippets.

• 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts.

About myself

2

Page 3: Alexander Sibiryakov- Frontera

• Over 2 billion requests per month (~800/sec.)

• Focused crawls & Broad crawls

We help turn web content into useful data

3

{ "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/2015/oct/05/world-bank-extreme-poverty-to-fall-below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item?id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user?id=hliyan" } },

Page 4: Alexander Sibiryakov- Frontera

Broad crawl usages

4

Lead generation (extracting contact

information)News analysis

Topical crawling

Plagiarism detection Sentiment analysis (popularity, likability)

Due diligence (profile/business data)

Track criminal activity & find lost persons (DARPA)

Page 5: Alexander Sibiryakov- Frontera

Saatchi Global Gallery Guidewww.globalgalleryguide.com

• Discover 11K online galleries.

• Extract general information, art samples, descriptions.

• NLP-based extraction. • Find more galleries on the

web.

Page 6: Alexander Sibiryakov- Frontera

Frontera recipes• Multiple websites data collection automation

• «Grep» of the internet segment

• Topical crawling

• Extracting data from arbitrary document

Page 7: Alexander Sibiryakov- Frontera

Multiple websites data collection automation

• Scrapers from multiple websites.

• Data items collected and updated.

• Frontera can be used to

• crawl in parallel and scale the process,

• schedule revisiting (within fixed time),

• prioritize the URLs during crawling.

Page 8: Alexander Sibiryakov- Frontera

«Grep» of the internet segment• alternative to Google,

• collect the zone files from registrars (.com/.net/.org),

• setup Frontera in distributed mode,

• implement text processing in spider code,

• output items with matched pages.

Page 9: Alexander Sibiryakov- Frontera

Topical crawling• document topic classifier & seeds URL list,

• if document is classified as positive crawler -> extracted links,

• Frontera in distributed mode,

• topic classifier code put in spider.

Extensions: link classifier, follow/final classifiers.

Page 10: Alexander Sibiryakov- Frontera

Extracting data from arbitrary document

• Tough problem. Can’t be solved completely.

• Can be seen as a structured prediction problem:

• Conditional Random Fields (CRF) or

• Hidden Markov Models (HMM).

• Tagged sequence of tokens and HTML tags can be used to predict the data fields boundaries.

• Webstruct and WebAnnotator Firefox extension.

Page 11: Alexander Sibiryakov- Frontera

Task• Spanish web: hosts and

their sizes statistics. • Only .es ccTLD. • Breadth-first strategy:

• first 1-click environ, • 2, • 3, • …

• Finishing condition: 100 docs from host max., all hosts

• Low costs.

11

Page 12: Alexander Sibiryakov- Frontera

Spanish, Russian, German and world Web in 2012

12

Domains Web servers Hosts DMOZ*

Spanish (.es) 1,5M 280K 4,2M 122KRussian

(.ru, .рф, .su) 4,8M 2,6M ? 105K

German (.de) 15,0M 3,7M 20,4M 466K

World 233M 62M 890M 3,9M

Sources: OECD Communications Outlook 2013, statdom.ru* - current period (October 2015)

Page 13: Alexander Sibiryakov- Frontera

Solution• Scrapy (based on Twisted) - network operations.

• Apache Kafka - data bus (offsets, partitioning).

• Apache HBase - storage (random access, linear scanning, scalability).

• Twisted.Internet - library for async primitives for use in workers.

• Snappy - efficient compression algorithm for IO-bounded applications.

13

Page 14: Alexander Sibiryakov- Frontera

ArchitectureKafka topic

SW Crawling strategy workers

Storage workers

14

DB

Page 15: Alexander Sibiryakov- Frontera

1. Big and small hosts problem

• Queue is flooded with URLs from the same host.

• → underuse of spider resources.

• additional per-host (per-IP) queue and metering algorithm.

• URLs from big hosts are cached in memory.

15

Page 16: Alexander Sibiryakov- Frontera

2. DDoS DNS service Amazon AWS

Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.

Recursive DNS server • on every spider node,• upstream to Verizon &

OpenDNS.We used dnsmasq.

16

Page 17: Alexander Sibiryakov- Frontera

3. Tuning Scrapy thread pool’а for efficient DNS resolution

• OS DNS resolver,• blocking calls,• thread pool to resolve

DNS name to IP.• numerous errors and

timeouts 🆘• A patch for thread

pool size and timeout adjustment.

17

Page 18: Alexander Sibiryakov- Frontera

4. Overloaded HBase region servers during state check

• 10^3 links per doc,• state check: CRAWLED/NOT

CRAWLED/ERROR,• HDDs.• Small volume 🆗• With ⬆table size, response

times ⬆ • Disk queue ⬆• Host-local fingerprint

function for keys in HBase.• Tuning HBase block cache to

fit average host states into one block.

18

3Tb of metadata.URLs, timestamps,…275 b/doc

Page 19: Alexander Sibiryakov- Frontera

5. Intensive network traffic from workers to services

• Throughput between workers and Kafka/HBase ~ 1Gbit/s.

• Thrift compact protocol for HBase

• Message compression in Kafka with Snappy

19

Page 20: Alexander Sibiryakov- Frontera

6. Further query and traffic optimizations to HBase

• State check: lots of reqs and network

• Consistency• Local state cache

in strategy worker.• For consistency,

spider log was partitioned by host.

20

Page 21: Alexander Sibiryakov- Frontera

State cache• All ops are batched:

– If no key in cache→ read HBase

– every ~4K docs → flush

• Close to 3M (~1Gb) elms → flush & cleanup

• Least-Recently-Used (LRU) 👍

21

Page 22: Alexander Sibiryakov- Frontera

Spider priority queue (slot)• Cell:

Array of:- fingerprint, - Crc32(hostname), - URL, - score

• Dequeueing top N.

• Prone to huge hosts

• Scoring model: document count per host.

22

Page 23: Alexander Sibiryakov- Frontera

7. Problem of big and small hosts (strikes back!)

• Discovered few very huge hosts (>20M docs)

• All queue partitions were flooded with huge hosts,

• Two MapReduce jobs:– queue shuffling,– limit all hosts to

100 docs MAX.

23

Page 24: Alexander Sibiryakov- Frontera

Spanish (.es) internet crawl results

• fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites

• 68.7K domains found (~600K expected),

• 46.5M crawled pages overall,

• 1.5 months,

• 22 websites with more than 50M pages

24

Page 25: Alexander Sibiryakov- Frontera

where are the rest of web servers?!

Page 26: Alexander Sibiryakov- Frontera

Bow-tie modelA. Broder et al. / Computer Networks 33 (2000) 309-320

26

Page 27: Alexander Sibiryakov- Frontera

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

27

Page 28: Alexander Sibiryakov- Frontera

12 years dynamicsGraph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014

28

Page 29: Alexander Sibiryakov- Frontera

• Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel.

• Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:

• 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.

Hardware requirements (distributed backend+spiders)

29

Page 30: Alexander Sibiryakov- Frontera

Software requirements

30

Single process Distributed spiders Distributed backend

Python 2.7+, Scrapy 1.0.4+

sqlite or any other RDBMS HBase/RDBMS

- ZeroMQ or Kafka

- - DNS Service

Page 31: Alexander Sibiryakov- Frontera

• Online operation: scheduling of new batch, updating of DB state.

• Storage abstraction: write your own backend (sqlalchemy, HBase is included).

• Run modes: single process, distributed spiders, dist. backend

• Scrapy ecosystem: good documentation, big community, ease of customization.

Main features

31

Page 32: Alexander Sibiryakov- Frontera

• Message bus abstraction (ZeroMQ and Kafka are available out-of-the box).

• Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.

• Polite by design: each website is downloaded by at most one spider.

• Canonical URLs resolution abstraction: each document has many URLs, which to use?

• Python: workers, spiders.

Main features

32

Page 33: Alexander Sibiryakov- Frontera

ReferencesGitHub: https://github.com/scrapinghub/frontera

RTD: http://frontera.readthedocs.org/

Google groups: Frontera (https://goo.gl/ak9546)

33

Page 34: Alexander Sibiryakov- Frontera

Future plans• Python 3 support,

• Docker images,

• Web UI,

• Watchdog solution: tracking website content changes.

• PageRank or HITS strategy.

• Own HTML and URL parsers.

• Integration into Scrapinghub services.

• Testing on larger volumes.

34

Page 35: Alexander Sibiryakov- Frontera

Run your business using Frontera

 SCALABLE

 OPEN CUSTOMIZABLE

Made in Scrapinghub (authors of Scrapy)

Page 36: Alexander Sibiryakov- Frontera

Contribute!• Web scale crawler,

• Historically first attempt in Python,

• Truly resource-intensive task: CPU, network, disks.

36

Page 37: Alexander Sibiryakov- Frontera

We’re hiring!http://scrapinghub.com/jobs/

37

Page 38: Alexander Sibiryakov- Frontera

38

Mandatory sales slideCrawl the web, at scale

• cloud-based platform

• smart proxy rotator Get data, hassle-free • off-the-shelf datasets

• turn-key web scraping

try.scrapinghub.com/PDB16

Page 39: Alexander Sibiryakov- Frontera

Questions?Thank you!

Alexander Sibiryakov, [email protected]