Upload
pydata
View
138
Download
0
Embed Size (px)
Citation preview
Frontera: open source, large scale web crawling framework
Alexander Sibiryakov, May 20, 2016, PyData Berlin 2016 [email protected]
• Software Engineer @ Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search quality department: social and QA search, snippets.
• 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts.
About myself
2
• Over 2 billion requests per month (~800/sec.)
• Focused crawls & Broad crawls
We help turn web content into useful data
3
{ "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http://www.theguardian.com/society/2015/oct/05/world-bank-extreme-poverty-to-fall-below-10-of-world-population-for-first-time" }, "points": "9 points", "time_ago": { "text": "2 hours ago", "href": "https://news.ycombinator.com/item?id=10352189" }, "username": { "text": "hliyan", "href": "https://news.ycombinator.com/user?id=hliyan" } },
Broad crawl usages
4
Lead generation (extracting contact
information)News analysis
Topical crawling
Plagiarism detection Sentiment analysis (popularity, likability)
Due diligence (profile/business data)
Track criminal activity & find lost persons (DARPA)
Saatchi Global Gallery Guidewww.globalgalleryguide.com
• Discover 11K online galleries.
• Extract general information, art samples, descriptions.
• NLP-based extraction. • Find more galleries on the
web.
Frontera recipes• Multiple websites data collection automation
• «Grep» of the internet segment
• Topical crawling
• Extracting data from arbitrary document
Multiple websites data collection automation
• Scrapers from multiple websites.
• Data items collected and updated.
• Frontera can be used to
• crawl in parallel and scale the process,
• schedule revisiting (within fixed time),
• prioritize the URLs during crawling.
«Grep» of the internet segment• alternative to Google,
• collect the zone files from registrars (.com/.net/.org),
• setup Frontera in distributed mode,
• implement text processing in spider code,
• output items with matched pages.
Topical crawling• document topic classifier & seeds URL list,
• if document is classified as positive crawler -> extracted links,
• Frontera in distributed mode,
• topic classifier code put in spider.
Extensions: link classifier, follow/final classifiers.
Extracting data from arbitrary document
• Tough problem. Can’t be solved completely.
• Can be seen as a structured prediction problem:
• Conditional Random Fields (CRF) or
• Hidden Markov Models (HMM).
• Tagged sequence of tokens and HTML tags can be used to predict the data fields boundaries.
• Webstruct and WebAnnotator Firefox extension.
Task• Spanish web: hosts and
their sizes statistics. • Only .es ccTLD. • Breadth-first strategy:
• first 1-click environ, • 2, • 3, • …
• Finishing condition: 100 docs from host max., all hosts
• Low costs.
11
Spanish, Russian, German and world Web in 2012
12
Domains Web servers Hosts DMOZ*
Spanish (.es) 1,5M 280K 4,2M 122KRussian
(.ru, .рф, .su) 4,8M 2,6M ? 105K
German (.de) 15,0M 3,7M 20,4M 466K
World 233M 62M 890M 3,9M
Sources: OECD Communications Outlook 2013, statdom.ru* - current period (October 2015)
Solution• Scrapy (based on Twisted) - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear scanning, scalability).
• Twisted.Internet - library for async primitives for use in workers.
• Snappy - efficient compression algorithm for IO-bounded applications.
13
ArchitectureKafka topic
SW Crawling strategy workers
Storage workers
14
DB
1. Big and small hosts problem
• Queue is flooded with URLs from the same host.
• → underuse of spider resources.
• additional per-host (per-IP) queue and metering algorithm.
• URLs from big hosts are cached in memory.
15
2. DDoS DNS service Amazon AWS
Breadth-first strategy → first visiting of unknown hosts → generating huge amount of DNS reqs.
Recursive DNS server • on every spider node,• upstream to Verizon &
OpenDNS.We used dnsmasq.
16
3. Tuning Scrapy thread pool’а for efficient DNS resolution
• OS DNS resolver,• blocking calls,• thread pool to resolve
DNS name to IP.• numerous errors and
timeouts 🆘• A patch for thread
pool size and timeout adjustment.
17
4. Overloaded HBase region servers during state check
• 10^3 links per doc,• state check: CRAWLED/NOT
CRAWLED/ERROR,• HDDs.• Small volume 🆗• With ⬆table size, response
times ⬆ • Disk queue ⬆• Host-local fingerprint
function for keys in HBase.• Tuning HBase block cache to
fit average host states into one block.
18
3Tb of metadata.URLs, timestamps,…275 b/doc
5. Intensive network traffic from workers to services
• Throughput between workers and Kafka/HBase ~ 1Gbit/s.
• Thrift compact protocol for HBase
• Message compression in Kafka with Snappy
19
6. Further query and traffic optimizations to HBase
• State check: lots of reqs and network
• Consistency• Local state cache
in strategy worker.• For consistency,
spider log was partitioned by host.
20
State cache• All ops are batched:
– If no key in cache→ read HBase
– every ~4K docs → flush
• Close to 3M (~1Gb) elms → flush & cleanup
• Least-Recently-Used (LRU) 👍
21
Spider priority queue (slot)• Cell:
Array of:- fingerprint, - Crc32(hostname), - URL, - score
• Dequeueing top N.
• Prone to huge hosts
• Scoring model: document count per host.
22
7. Problem of big and small hosts (strikes back!)
• Discovered few very huge hosts (>20M docs)
• All queue partitions were flooded with huge hosts,
• Two MapReduce jobs:– queue shuffling,– limit all hosts to
100 docs MAX.
23
Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites
• 68.7K domains found (~600K expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than 50M pages
24
where are the rest of web servers?!
Bow-tie modelA. Broder et al. / Computer Networks 33 (2000) 309-320
26
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
27
12 years dynamicsGraph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
28
• Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example:
• 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.
Hardware requirements (distributed backend+spiders)
29
Software requirements
30
Single process Distributed spiders Distributed backend
Python 2.7+, Scrapy 1.0.4+
sqlite or any other RDBMS HBase/RDBMS
- ZeroMQ or Kafka
- - DNS Service
• Online operation: scheduling of new batch, updating of DB state.
• Storage abstraction: write your own backend (sqlalchemy, HBase is included).
• Run modes: single process, distributed spiders, dist. backend
• Scrapy ecosystem: good documentation, big community, ease of customization.
Main features
31
• Message bus abstraction (ZeroMQ and Kafka are available out-of-the box).
• Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.
• Polite by design: each website is downloaded by at most one spider.
• Canonical URLs resolution abstraction: each document has many URLs, which to use?
• Python: workers, spiders.
Main features
32
ReferencesGitHub: https://github.com/scrapinghub/frontera
RTD: http://frontera.readthedocs.org/
Google groups: Frontera (https://goo.gl/ak9546)
33
Future plans• Python 3 support,
• Docker images,
• Web UI,
• Watchdog solution: tracking website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub services.
• Testing on larger volumes.
34
Run your business using Frontera
SCALABLE
OPEN CUSTOMIZABLE
Made in Scrapinghub (authors of Scrapy)
Contribute!• Web scale crawler,
• Historically first attempt in Python,
• Truly resource-intensive task: CPU, network, disks.
36
38
Mandatory sales slideCrawl the web, at scale
• cloud-based platform
• smart proxy rotator Get data, hassle-free • off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/PDB16
Questions?Thank you!
Alexander Sibiryakov, [email protected]