22
Scraping the Web with Scrapinghub For Startups

Scrapinghub Deck for Startups

Embed Size (px)

Citation preview

Page 1: Scrapinghub Deck for Startups

Scraping the Web with Scrapinghub

For Startups

Page 2: Scrapinghub Deck for Startups

“Getting information off the Internet is like taking a drink from a fire hydrant.”

– Mitchell Kapor

Page 3: Scrapinghub Deck for Startups

Who Uses Web Scraping

It is used by everyone from individuals to multinational companies:

● Monitor your competitors’ prices by scraping product information

● Detect fraudulent reviews and sentiment changes by scraping product reviews

● Track online reputation by scraping social media profiles

● Create apps that use public data ● Track SEO by scraping search engine

results

Page 4: Scrapinghub Deck for Startups

Web Scraping Traffic

Page 5: Scrapinghub Deck for Startups

Scrapinghub

Our products empower our users to scrape data quickly and effectively using open source technologies. We offer:

● A cloud-based platform to help you scale your crawlers

● A smart proxy rotator to crawl the web even faster

● Professional Services to handle web scraping and data mining for you

● Off-the-shelf datasets so you can get data hassle-free

Page 6: Scrapinghub Deck for Startups

Scrapy

Scrapy is a web scraping framework that gets the dirty work related to web crawling out of your way.

Benefits● No platform lock-in: Open Source● Very popular (13k+ ★)● Battle tested● Highly extensible● Great documentation

Page 7: Scrapinghub Deck for Startups

Portia

Portia is a Visual Scraping tool that lets you get data without needing to write code.

Benefits● No platform lock-in: Open Source● JavaScript dynamic content

generation● Ideal for non-developers● Extensible● It’s as easy as annotating a page

Page 8: Scrapinghub Deck for Startups

Portia

Page 9: Scrapinghub Deck for Startups

Large Scale Infrastructure

Meet Scrapy Cloud , our PaaS for web crawlers:

● Scalable: Crawlers run on EC2 instances or dedicated servers● Crawlera add-on● Control your spiders: Command line, API or web UI● Machine learning integration: BigML, MonkeyLearn, among

others● No lock-in: scrapyd to run Scrapy spiders on your own

infrastructure

Page 10: Scrapinghub Deck for Startups

Broad Crawls

Frontera allows us to build large scale web crawlers in Python:

● Scrapy support out of the box● Distribute and scale custom web crawlers across servers● Crawl Frontier Framework: large scale URL prioritization logic● Aduana to prioritize URLs based on link analysis (PageRank,

HITS)

Page 11: Scrapinghub Deck for Startups

Web Scraping Pitfalls

Page 12: Scrapinghub Deck for Startups

Bot Countermeasures

Websites are using increasingly sophisticated techniques to protect against bad bots.

Unfortunately, these same technologies often prevent harmless bots from scraping content.

Common countermeasures include:● IP address-based bans● JavaScript and session based counter-

measures

Page 13: Scrapinghub Deck for Startups

Blocked Crawlers

Servers identify and block crawlers that continuously fire many requests to a website.

Solution: Meet Crawlera , our smart proxy rotator for web crawlers.

● Routes requests through a pool of 50k+ IPs● Detects, logs and handles bans● Polite scraping: Automatically throttles requests to

websites

Page 14: Scrapinghub Deck for Startups

JavaScript in Web Pages

Dynamic content generated by JavaScript is often used by websites to render the page (SPA) or to avoid being scraped by naive crawlers.

For simple instances, you can emulate the AJAX requests in Scrapy.

For complex cases, you can use Splash ● Works through an HTTP API● Lua Scripts simulate user interaction● No lock-in, it’s an open source project!

Page 15: Scrapinghub Deck for Startups

Duplicate Content

The web is full of duplicate content.

Duplicate Content negatively impacts:● Storage● Re-crawl performance● Quality of data

Efficient algorithms for Near Duplicate Detection, like SimHash, are applied to estimate similarity between web pages to avoid scraping duplicated content.

Page 16: Scrapinghub Deck for Startups

Near Duplicate Detection Uses

Compare prices of products scraped from different retailers by finding near duplicates in a dataset:

Merge similar items to avoid duplicate entries:

Title Store Price

ThinkPad X220 Laptop Lenovo (i7 2.8GHz, 12.5 LED, 320 GB) Acme Store 599.89

Lenovo Thinkpad Notebook Model X220 (i7 2.8, 12.5’’, HDD 320) XYZ Electronics 559.95

Name Summary Location

Saint Fin Barre’s Cathedral Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges…

51.8944, -8.48064

St. Finbarr’s Cathedral Cork Designed by William Burges and consecrated in 1870, ... 51.894401550293, -8.48064041137695

Page 17: Scrapinghub Deck for Startups

Examples of Web Scraping Usage

Page 18: Scrapinghub Deck for Startups

Competitor Monitoring

E-commerce companies use web scraping to monitor the price fluctuations and the ratings of competitors:

● Scrape online retailers● Structure the data in a search engine or

DB● Create an interface to search for

products● Sentiment analysis for product rankings

Page 19: Scrapinghub Deck for Startups

We help electronics companies monitor the activities of their resellers:

● Tracking and watching out for stolen goods

● Pricing agreement violations

● Customer support responses on complaints ● Product line quality checks

Monitor Resellers

Page 20: Scrapinghub Deck for Startups

Lead Generation

Mine scraped data to identify who to target in a company for your outbound sales campaigns:

● Locate possible leads in your target market● Identify the right contacts within each one● Augment the information you already have on them● Use data science to guess their email address

Page 21: Scrapinghub Deck for Startups

Reduce the time spent on HR tasks by creating a select pool of applicants:

● Mine scraped data to locate candidates

● Match requisite skills and background

● Spot and rescue employees that are shopping for a new job

Human Resources

Page 22: Scrapinghub Deck for Startups

Thank you!Thank you!

scrapinghub.com