Web Scraping for Code-ophobes

Preview:

DESCRIPTION

Learn to scrape data in Google Docs using ImportFeed, ImportHTML, and ImportXML. Annie Cushing, Senior SEO at SEER Interactive (@AnnieCushing on Twitter) isn't a developer, so she breaks this process down into easy-to-understand steps - and provides a link to a Google Doc where you can follow along and learn from!

Citation preview

Web Scraping

@AnnieCushing

For Code-ophobes

What I’m not

@AnnieCushing

What I am

THE WIND BENEATH MY WEB-SCRAPING WINGS

@djchrisle

@ethanlyon

@AnnieCushing

3 WAYS TO SCRAPE IN GOOGLE DOCS

• ImportFeed• ImportHTML• ImportXML

@AnnieCushing

=ImportFeed

ImportFeed

=ImportFeed(URL, query, headers, numItems)

http://bit.ly/importfeed@AnnieCushing

=ImportFeed("http://feeds.searchengineland.com/searchengineland")

OR

=ImportFeed(C4) My preference

@AnnieCushing

@AnnieCushing

http://slidesha.re/stalker-wil

STALKING FOR LINKS

BY @WILREYNOLDS

=ImportHTML

ImportHTML

• Table• List

TWO OPTIONS

@AnnieCushing

=ImportHtml(URL, query, index)

URL: “www.domain.com/whatever” OR cell reference query: “table” or “list” OR cell referenceindex: If multiple lists or tables, which one (3 = 3rd table)

@AnnieCushing

Table Example of ImportHTML

@AnnieCushing

List Example of ImportHTML

@AnnieCushing

=ImportXML

ImportXML

http://bit.ly/xpath-tutorial

=ImportXML(URL, query)

@AnnieCushing

Simple Explanation of XPath

XPath uses path expressions to select nodes or node-sets in an XML document.

@AnnieCushing

@AnnieCushing

7 Types of Nodes

@AnnieCushing

Simple Explanation of XPath

<div><p><blockquote><price><ul>

ELEMENTS

@AnnieCushing

• As you drill down, you separate nodes with /

• Ex: /html/div/ul/li/a

PARENT-CHILD NODES

@AnnieCushing

classidsize

Look for the = sign

ATTRIBUTES

@AnnieCushing

Simple Explanation of XPath

/: Starts at the root//: Starts wherever @: Selects attributes []: Answers the question “Which one?”[*]: All

KEY CHARACTERS

@AnnieCushing

Let’s Start Simple

@AnnieCushing

Magic!

@AnnieCushing

Grab the URLs

@AnnieCushing

Because it’s an @tribute!

Let’s dial it up

@AnnieCushing

http://bit.ly/distilled-xml

@AnnieCushing

@AnnieCushing

What if your child nodes look like this?

Let’s dial it up

@AnnieCushing

Could do it this way

@AnnieCushing

At your own risk

@AnnieCushing

Better plan

@AnnieCushing

The world according to Annie

// = blah blah yada yada

@AnnieCushing

Can even be in the middle of the XPath

//div[@class=‘main’]//blockquote[2]

@AnnieCushing

Other ways to tell “which one” in XPath

STARTS-WITH

@AnnieCushing

Other ways to tell “which one” in XPath

@AnnieCushing

CONTAINS

Other ways to tell “which one” in XPath

@AnnieCushing

Other ways to tell “which one” in XPath

INDEX VALUE

@AnnieCushing

Other ways to tell “which one” in XPath

LAST()

@AnnieCushing

Become a scraping FOOL

@NicoMiceli

@AnnieCushing

• Pull queries from Topsy• Pull product feeds• Pull specific elements from a sitemap• Scrape Twitter followers• Pull GA metrics• Scrape HTML tables (e.g., list of countries from Wikipedia)• Scrape lists (e.g., scraped lists of consumer review sites to create

a custom search engine, top sports blogs, etc.)• Scrape rankings• Scrape GA codes / Adsense IDs / IPs / IP Country Codes• Find de-indexed sites• Scrape directories• Scrape Yahoo / Google for relevant pages from directory listings• Scraping title / h1 / meta descriptions• Scrape page URLs to find if someone is linking to you• Scrape Google to find snippets of text on a list of domains (for link

networks)• Scrape Quora

43

SEE IMPORT FUNCTIONS IN THEIR NATURAL HABITAT!http://bit.ly/annies-gdoc@AnnieCushin

g

AWWW YEAHHH!

TO PLAY …

1. Log in2. File > Make a copy…3. Poke around and test

@AnnieCushing

RESOURCES

XPath Tutorial: http://bit.ly/xpath-tutorial Annie’s Gdoc: http://bit.ly/annies-gdocDistilled Guide: http://bit.ly/distilled-guideSEER Cookbook: http://bit.ly/seer-cookbook

@AnnieCushing

Recommended