Web scrapingpanel

Almost Scraping: Web Scraping for Non-Programmers

Michelle Minkoff, PBSNews.orgMatt Wynn, Omaha World-Herald

What is Web scraping?The *all-knowing* Wikipedia says:

“Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites. …Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.”

Why do I want to Web scrape?

Journalists like to find stories

Editors like stories that are exclusive

Downloading a dataset is like going to a press conference, anyone can grab and use it.

Web scraping is like an enterprise story, less likely to be picked up by all.

Puts more control back into your hands

What kind of data can I get?

Laws (Summary of same-sex marriage laws for each state, pdfs)

Photos (pictures associated with all players on a team you’re highlighting, all mayoral candidates)

Recipe ingredients (NYT story about peanut butter)

Health care (see ProPublica’s Dollars for Docs project)

Links, images, dates, names, categories, tags, anything with some sort of repeatable structure

DownThemAllhttp://www.downthemall.net

Yahoo Pipeshttp://pipes.yahoo.com/pipes

Yahoo PipesAccess and manipulate RSS feeds, which are

often a flurry of information

Sort, filter, combine your information

Format that info to fit your needs (date formatter)

Yahoo PipesPair with Versionista, which can create an RSS

feed of changes to a Web site to keep tabs on what’s changing. This was done to great effect by ProPublica’s team in late 2009, esp. by Scott Klein and then-intern Brian Boyer, now at Chicago Tribune

ScraperWikihttp://scraperwiki.com

Needlebasehttp://needlebase.com

NeedlebaseFor sites that follow a repetitive formula

spanning multiple pages, like index pg & detail page, maybe with a search results page in the middle

Like a good employee, train it once, and then let it churn.

NeedlebaseQuery, select and filter your data in the Web

app, then export in format of your choice.

Can check your data and stay up-to-date on your data set

Will go more in depth on Needle in Sat.’s hands-on lab at 10 a.m.

InfoExtractorhttp://www.infoextractor.org

irobotsofthttp://irobotsoft.com

Imacroshttps://addons.mozilla.org/en-US/firefox/addon/

imacros-for-firefox/

ImacrosRecord repetitive tasks that you do every day, and

keep them as a data set

Think of it like a bookmark, but if you could include logging in, or entering a search term, as part of that bookmark

Useful for stats you check every day, scores for your local sports team, stocks if you’re a biz reporter, etc.

More complex function allows you to extract multiple data points on a page, like from an HTML table.

OutwitHubhttp://www.outwit.com/products/hub

OutwitHubVersatile Firefox extension

Can use it for certain defaults (links, images)

OutwitHubDig through the HTML hierarchy tree

Structural elements (<h3>)Stylistic elements (<strong>)

Download list of attached files or files themselves

More options if you buy Pro version

Will discuss in-depth and use in hands-on lab on Saturday at 10 am

Python

Wrap-UpNon-programming scrapers can’t do everything, but

have the power to get you started. Some say “Program or be programmed,” but this is a compromise.

Legal permissions still apply, so don’t use scraped info you don’t have the right to.

Something to consider. How does this apply to what you do every day, and how scraping could contribute to your job? “The businesses that win will be those that understand

how to build value from data from wherever it comes. Information isn’t power. The right information is.” – media consultant Neil Perkin wrote in Marketing Week

Web scrapingpanel

Technology

web designing tutorials,web designing companies,web design company,web development company

Web - Web engineering

Comparación Web 1.0 Web 2.0 Web 3.0

Deep Web Hidden Web Dark Web

Web - social Web (Web 2.0)

Attacking Web ServicesAttacking Web Services · XML- und Webund Web-Service-Sicherheit Attacking Web ServicesAttacking Web Services

WEB - Klopotek · WEB Rights Sales Manager WEB WEB Contract Workflow Manager Inventory Manager WEB Suppliers Online WEB Intercompany Publishing Deals WEB Author 360° WEB Contact

WEB 1.0- WEB 2.0- WEB 3.0

Inicios Web 1.0, web 2.0, web 3.0

Web Mining - 123seminarsonly.com · Outline • Hype of the web • Difficulties with web • Web Mining • Advantages / Disadvantages • Categories of Web Mining • Web Usage

Pagina web, sitio web, web 2.0 y web 3.0

Web 1.0, Web 2.0, Web 3.0

Web Design Guidelines by Scott Grissom 1 Designing for the Web Web site design Web page design Web usability Web site design Web page design

WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2 Web Mining Web Mining Taxonomy Web Usage Mining Web analysis tools Pattern

Orb Web Tangle Web Spider Web Wonders

Computacion Redes, Web, Web 2.0, Web 3.0

Web 1.0 Web 2.0 Web 3.0

Cloud SharePoint-hosted SharePoint Autohosted Provider-hosted Host web App web (optional) Host web App web Host web App web (optional)

Web - Web search

Web, web, web, deg, deg og SJH