How to scraping content from web for location-based mobile app

Preview:

Citation preview

Scraping content from web for location-based mobile

app.

Nguyen Hong Diepfounder, magik.vn

Summary

1. Web Scraping– Definitions– Value added– Analysis a Sample Case

2. Scrapy Framework– Overview– Architecture– A simple Scrapy program.

3. Build a auto scraping system for location-based apps– Extract LatLng from address– Extract phone number – Realtime update & continuous 24/7– Prevent duplication data– Deploy without a dedicated server or VPS

Web crawler

Internet bot that systematically browses the World Wide Web,

typically for web indexing.

Sources: wikipedia.org

Scrape

Crawl websites and extract structured data from pages.

Sources: wikipedia.org

Added Value?

giamua.com – “groupon”

baomoi.com

Added Value?

same user experiencebut

more content than

oizoioi.vn Price comparison for electronic

Added Value?

make

new knowledge from many informations

Wisdom

Knowledge

Information

Data

DIKW Hierachy

Nha Tro Tot

Added Value?

The smartphone revolutionnew platform

need new user experienced

Source: www.widexconnect.ca

And mores

Sources : Laban.vn

Analysis a sample case

(1)collect [home for sales] records from Web

(2)from many websites in Vietnam(3) as soon as they posted(4) continuous 24 / 7

Need

Step 1: Listing sources

Step 2: build general database

Step 3: Ctrl+C, Ctrl+V

• For every sites:– Find listing latest records webpage link.– For every record :• Check if new record

– Copy & paste fields into a new record in my DB.

Step 3: Ctrl+C, Ctrl+V

Bước 3 : Let’s Scrapy

Scrapy Framework

• Overview• Architecture• Xpath• Make a simple Scrapy program.

• Scrapy is a fast high-level screen scraping and web crawling framework.

• Open-source, 100% Python => Portable

Scrapy’s github info

• From 2008

• Stats

Architecture

Source: http://doc.scrapy.org/en/0.12/topics/architecture.html

XPath

Navigate through elements and attributes

in an XML document.

Simple Scrapy Program

• (1) Pick a website – http://www.mininova.org/today

• (2) Define the data you want to scrape

Simple Scrapy Program (cont.)

• (3) Write a Spider to extract the data

Simple Scrapy Program (cont.)

(4) Run the spider to extract the data

(5) Review scraped data

Build a auto scraping system for location-based apps

• Extract LatLng from address• Extract phone number • Realtime update & continuous 24/7• Prevent duplication data• Deploy without a dedicated server or

VPS

Extract LatLng from address

• Use Google Geocode• https://maps.googleapis.com/maps/api/geocode/json?

address=xxx&sensor=true_or_false&key=API_KEY

Extract LatLng from address (cont.)

Extract LatLng from address (cont.)

Extract Phone Number

• Libphonenumber’s python port.

• Sample

“Real time” update and continuous 24/7.

• Task Schedule (Windows)

• Cron jobs (Linux)

Prevent duplication data

• Make a middleware for ignore exists Item. IgnoreExistsMiddleW

are

Without a dedicated server or VPS

• Problems: my server-side is on a cpanel web hosting => can’t deploy scrapy

• Solutions: – Make a web services for sync new record data.

• /get_head_revision• /sync

– Scrapy run on my PC, then sync with server.