Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday,

Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday, April 28, 2015

Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower

Source: Wikipedia (Factory) to building sh*t that works

data science data products I worked on analytics infrastructure to support data science data products to surface relevant content to users

Gupta et al. WTF: The Who to Follow Service at Twitter. WWW 2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012 Busch et al. Earlybird: Real- Time Search at Twitter. ICDE 2012 Mishne et al. Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture. SIGMOD 2013. Leibert et al. Automatic Management of Partitioned, Replicated Search Services. SoCC 2011 I worked on analytics infrastructure to support data science data products to surface relevant content to users

Source: https://www.flickr.com/photos/bongtongol/3491316758/

circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily

Source: Wikipedia (All Souls College, Oxford) And back!

but theyre underused Source: http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg Web archives are an important part of our cultural heritage

Why? Restrictive use regimes? But I dont think thats all

Source: http://www.flickr.com/photos/cheryne/8417457803/ Users cant do much with current web archives Hard to develop tools for non-existent needs

We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues) Goal: tools to support exploration and discovery in web archives Beyond browsing Beyond searching Source: http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg

Source: Google What would a web archiving platform built on modern big data infrastructure look like?

OpenWayback: monolithic Tomcat application Scalable storage of archived data Efficient random access Scalable processing and analytics Scalable storage and access of derived data Desiderata Some work by the Internet Archive, Common Crawl, and others Petabox by Internet Archive; NAS, SAN, etc. by others Ad hoc storage in flat text WAT files

Scalable storage of archived data Efficient random access Scalable processing and analytics Scalable storage and access of derived data Desiderata HDFS HBase Hadoop HBase Existing tools arent adequate!

(file name, block id) (block id, block location) instructions to datanode datanode state (block id, byte range) block data HDFS namenode HDFS datanode Linux file system HDFS datanode Linux file system File namespace /foo/bar block 3df2 Application HDFS Client Stores data blocks across commodity servers Scales to 100s of PBs of data Open source implementation of the Google File System

map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2 k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78 a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 Suitable for batch processing on HDFS data Open source implementation of Googles framework

~ Googles Bigtable A collection of tables, each of which represents a sparse, distributed, persistent multidimensional sorted map Source: Bonaldo Big Table by Alain Gilles

Image Source: Chang et al., OSDI 2006 in a nutshell (row key, column family, column qualifier, timestamp) value row keys are lexicographically sorted column families define locality groups Client-side operations: gets, puts, deletes, range scans

Warcbase An open-source platform for managing web archives built on and H and http://warcbase.org/ Source: Wikipedia (Archive)

WARC data Ingestion Applications and Services Processing & Analytics

Scalability? We got 99 problems but scalability aint one Jay- Z Scalability of Warcbase limited by Hadoop/HBase Applications are lightweight clients

WARC data Ingestion Applications and Services Processing & Analytics text analysis, link analysis,

Sample dataset: crawl of the 108 th U.S. Congress Monthly snapshots, January 2003 to January 2005 1.15 TB gzipped ARC files 29 million captures of 7.8 million unique URLs 23.8 million captures are HTML files Hadoop/HBase cluster: 16 nodes, dual quad-core processors, 3 2TB disks each

OpenWayBack + Warcbase Integration Implementation: OpenWayback frontend for rendering Offloads storage management to HBase Transparently scales out with HBase

Topic Model Explorer Implementation LDA on each temporal slice Adaptation of Termite visualization

Webgraph Explorer Implementation Link extraction with Hadoop, site-level aggregation Computation of standard graph statistics d3 interactive visualization

We need deep collaborations between: Users (e.g., archivists, journalists, historians, digital humanists, etc.) Tool builders (me and my colleagues)

Warcbase:here.

Warcbase in a box Wide Web scrape (2011) Internet Archive wide0002 crawl Sample of 100 WARC files (~100 GB) Warcbase running on the Mac Pro: Ingestion in ~1 hour Extraction of webgraph using Pig in ~55 minutes Result: 17m links,.ca subset of 1.7m links Visualization in Gephi cylinder

Internet Archive, Wide Web Scrape from 2011

But they can probably afford a Mac Pro Whats the big deal? Historians probably cant afford Hadoop clusters How will this change historical scholarship? Visual graph analysis on longitudinal data, select subsets for further textual analysis all on your desktop! Drill down to examine individual pages

Warcbase:here. Bonus!

Raspberry Pi Experiments Columbia Universitys Human Rights Web Archive 43 GB of crawls from 2008 (1.68 million records) Warcbase running on the Raspberry Pi (standalone mode) Ingestion in ~27 hours (17.3 records/second) Random browsing (avg over 100 pages): 2.1s page load Same pages in Internet Archive: 2.9s page load Draws 2.4 Watts Jimmy Lin. Scaling Down Distributed Infrastructure on Wimpy Machines for Personal Web Archiving. Temporal Web Analytics Workshop 2015.

Throw in search, lightweight analytics, Whats the big deal? Store every page youve ever visited in your pocket! What will you do with the web in your pocket? How will this change how you interact with the web? When was the last time you searched to refind?

We need deep collaborations between: Users (e.g., archivists, journalisms, historians, digital humanists, etc.) Tool builders (me and my colleagues) Warcbase is just the first step Goal: tools to support exploration and discovery

Source: Wikipedia (Hammer) Questions?

Bigtable use case: storing web crawls! Row key: domain-reversed URL Column family: contents Raw data Value: raw HTML Column qualifier: Column family: anchor Derived data Value: anchor text Column qualifier: source URL

Warcbase: HBase data model Row key: domain-reversed URL Column family: c Value: raw source Column qualifier: MIME type Timestamp: crawl date

Documents

Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of Maryland @lintool IIPC 2015 General Assembly Tuesday,