Warcbase: Building a Scalable Web Archiving Platform on Hadoop
and HBase Jimmy Lin University of Maryland @lintool IIPC 2015
General Assembly Tuesday, April 28, 2015
Slide 2
Source: Wikipedia (All Souls College, Oxford) From the Ivory
Tower
Slide 3
Source: Wikipedia (Factory) to building sh*t that works
Slide 4
data science data products I worked on analytics infrastructure
to support data science data products to surface relevant content
to users
Slide 5
Gupta et al. WTF: The Who to Follow Service at Twitter. WWW
2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD
2012 Busch et al. Earlybird: Real- Time Search at Twitter. ICDE
2012 Mishne et al. Fast Data in the Era of Big Data: Twitter's
Real-Time Related Query Suggestion Architecture. SIGMOD 2013.
Leibert et al. Automatic Management of Partitioned, Replicated
Search Services. SoCC 2011 I worked on analytics infrastructure to
support data science data products to surface relevant content to
users
circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use
analytics stack daily circa ~2012 ~1400 people total 10s of Ks of
Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100
TB ingest daily dozens of teams use Hadoop daily 10s of Ks of
Hadoop jobs daily
Slide 8
Source: Wikipedia (All Souls College, Oxford) And back!
Slide 9
but theyre underused Source:
http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg
Web archives are an important part of our cultural heritage
Slide 10
Why? Restrictive use regimes? But I dont think thats all
Slide 11
Source: http://www.flickr.com/photos/cheryne/8417457803/ Users
cant do much with current web archives Hard to develop tools for
non-existent needs
Slide 12
We need deep collaborations between: Users (e.g., archivists,
journalists, historians, digital humanists, etc.) Tool builders (me
and my colleagues) Goal: tools to support exploration and discovery
in web archives Beyond browsing Beyond searching Source:
http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg
Slide 13
Source: Google What would a web archiving platform built on
modern big data infrastructure look like?
Slide 14
OpenWayback: monolithic Tomcat application Scalable storage of
archived data Efficient random access Scalable processing and
analytics Scalable storage and access of derived data Desiderata
Some work by the Internet Archive, Common Crawl, and others Petabox
by Internet Archive; NAS, SAN, etc. by others Ad hoc storage in
flat text WAT files
Slide 15
Scalable storage of archived data Efficient random access
Scalable processing and analytics Scalable storage and access of
derived data Desiderata HDFS HBase Hadoop HBase Existing tools
arent adequate!
Slide 16
(file name, block id) (block id, block location) instructions
to datanode datanode state (block id, byte range) block data HDFS
namenode HDFS datanode Linux file system HDFS datanode Linux file
system File namespace /foo/bar block 3df2 Application HDFS Client
Stores data blocks across commodity servers Scales to 100s of PBs
of data Open source implementation of the Google File System
Slide 17
map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2
k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78
a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 Suitable for batch
processing on HDFS data Open source implementation of Googles
framework
Slide 18
~ Googles Bigtable A collection of tables, each of which
represents a sparse, distributed, persistent multidimensional
sorted map Source: Bonaldo Big Table by Alain Gilles
Slide 19
Image Source: Chang et al., OSDI 2006 in a nutshell (row key,
column family, column qualifier, timestamp) value row keys are
lexicographically sorted column families define locality groups
Client-side operations: gets, puts, deletes, range scans
Slide 20
Warcbase An open-source platform for managing web archives
built on and H and http://warcbase.org/ Source: Wikipedia
(Archive)
Slide 21
WARC data Ingestion Applications and Services Processing &
Analytics
Slide 22
Scalability? We got 99 problems but scalability aint one Jay- Z
Scalability of Warcbase limited by Hadoop/HBase Applications are
lightweight clients
Slide 23
WARC data Ingestion Applications and Services Processing &
Analytics text analysis, link analysis,
Slide 24
Sample dataset: crawl of the 108 th U.S. Congress Monthly
snapshots, January 2003 to January 2005 1.15 TB gzipped ARC files
29 million captures of 7.8 million unique URLs 23.8 million
captures are HTML files Hadoop/HBase cluster: 16 nodes, dual
quad-core processors, 3 2TB disks each
Slide 25
OpenWayBack + Warcbase Integration Implementation: OpenWayback
frontend for rendering Offloads storage management to HBase
Transparently scales out with HBase
Slide 26
Slide 27
Slide 28
Topic Model Explorer Implementation LDA on each temporal slice
Adaptation of Termite visualization
Slide 29
Slide 30
Webgraph Explorer Implementation Link extraction with Hadoop,
site-level aggregation Computation of standard graph statistics d3
interactive visualization
Slide 31
Slide 32
Slide 33
Slide 34
We need deep collaborations between: Users (e.g., archivists,
journalists, historians, digital humanists, etc.) Tool builders (me
and my colleagues)
Slide 35
Warcbase:here.
Slide 36
Warcbase in a box Wide Web scrape (2011) Internet Archive
wide0002 crawl Sample of 100 WARC files (~100 GB) Warcbase running
on the Mac Pro: Ingestion in ~1 hour Extraction of webgraph using
Pig in ~55 minutes Result: 17m links,.ca subset of 1.7m links
Visualization in Gephi cylinder
Slide 37
Internet Archive, Wide Web Scrape from 2011
Slide 38
But they can probably afford a Mac Pro Whats the big deal?
Historians probably cant afford Hadoop clusters How will this
change historical scholarship? Visual graph analysis on
longitudinal data, select subsets for further textual analysis all
on your desktop! Drill down to examine individual pages
Slide 39
Warcbase:here. Bonus!
Slide 40
Raspberry Pi Experiments Columbia Universitys Human Rights Web
Archive 43 GB of crawls from 2008 (1.68 million records) Warcbase
running on the Raspberry Pi (standalone mode) Ingestion in ~27
hours (17.3 records/second) Random browsing (avg over 100 pages):
2.1s page load Same pages in Internet Archive: 2.9s page load Draws
2.4 Watts Jimmy Lin. Scaling Down Distributed Infrastructure on
Wimpy Machines for Personal Web Archiving. Temporal Web Analytics
Workshop 2015.
Slide 41
Throw in search, lightweight analytics, Whats the big deal?
Store every page youve ever visited in your pocket! What will you
do with the web in your pocket? How will this change how you
interact with the web? When was the last time you searched to
refind?
Slide 42
We need deep collaborations between: Users (e.g., archivists,
journalisms, historians, digital humanists, etc.) Tool builders (me
and my colleagues) Warcbase is just the first step Goal: tools to
support exploration and discovery
Slide 43
Source: Wikipedia (Hammer) Questions?
Slide 44
Bigtable use case: storing web crawls! Row key: domain-reversed
URL Column family: contents Raw data Value: raw HTML Column
qualifier: Column family: anchor Derived data Value: anchor text
Column qualifier: source URL
Slide 45
Warcbase: HBase data model Row key: domain-reversed URL Column
family: c Value: raw source Column qualifier: MIME type Timestamp:
crawl date