Warcbase: Building a Scalable Web Archiving Platform on Hadoop and HBase Jimmy Lin University of...
Preview:
Citation preview
- Slide 1
- Warcbase: Building a Scalable Web Archiving Platform on Hadoop
and HBase Jimmy Lin University of Maryland @lintool IIPC 2015
General Assembly Tuesday, April 28, 2015
- Slide 2
- Source: Wikipedia (All Souls College, Oxford) From the Ivory
Tower
- Slide 3
- Source: Wikipedia (Factory) to building sh*t that works
- Slide 4
- data science data products I worked on analytics infrastructure
to support data science data products to surface relevant content
to users
- Slide 5
- Gupta et al. WTF: The Who to Follow Service at Twitter. WWW
2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD
2012 Busch et al. Earlybird: Real- Time Search at Twitter. ICDE
2012 Mishne et al. Fast Data in the Era of Big Data: Twitter's
Real-Time Related Query Suggestion Architecture. SIGMOD 2013.
Leibert et al. Automatic Management of Partitioned, Replicated
Search Services. SoCC 2011 I worked on analytics infrastructure to
support data science data products to surface relevant content to
users
- Slide 6
- Source:
https://www.flickr.com/photos/bongtongol/3491316758/
- Slide 7
- circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use
analytics stack daily circa ~2012 ~1400 people total 10s of Ks of
Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100
TB ingest daily dozens of teams use Hadoop daily 10s of Ks of
Hadoop jobs daily
- Slide 8
- Source: Wikipedia (All Souls College, Oxford) And back!
- Slide 9
- but theyre underused Source:
http://images3.nick.com/nick-assets/shows/images/house-of-anubis/flipbooks/hidden-room/hidden-room-04.jpg
Web archives are an important part of our cultural heritage
- Slide 10
- Why? Restrictive use regimes? But I dont think thats all
- Slide 11
- Source: http://www.flickr.com/photos/cheryne/8417457803/ Users
cant do much with current web archives Hard to develop tools for
non-existent needs
- Slide 12
- We need deep collaborations between: Users (e.g., archivists,
journalists, historians, digital humanists, etc.) Tool builders (me
and my colleagues) Goal: tools to support exploration and discovery
in web archives Beyond browsing Beyond searching Source:
http://waterloocyclingclub.ca/wp-content/uploads/2013/05/Help-Wanted-Sign.jpg
- Slide 13
- Source: Google What would a web archiving platform built on
modern big data infrastructure look like?
- Slide 14
- OpenWayback: monolithic Tomcat application Scalable storage of
archived data Efficient random access Scalable processing and
analytics Scalable storage and access of derived data Desiderata
Some work by the Internet Archive, Common Crawl, and others Petabox
by Internet Archive; NAS, SAN, etc. by others Ad hoc storage in
flat text WAT files
- Slide 15
- Scalable storage of archived data Efficient random access
Scalable processing and analytics Scalable storage and access of
derived data Desiderata HDFS HBase Hadoop HBase Existing tools
arent adequate!
- Slide 16
- (file name, block id) (block id, block location) instructions
to datanode datanode state (block id, byte range) block data HDFS
namenode HDFS datanode Linux file system HDFS datanode Linux file
system File namespace /foo/bar block 3df2 Application HDFS Client
Stores data blocks across commodity servers Scales to 100s of PBs
of data Open source implementation of the Google File System
- Slide 17
- map Shuffle and Sort: aggregate values by keys reduce k1k1 k2k2
k3k3 k4k4 k5k5 k6k6 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 ba12cc36ac52bc78
a15b27c2368 r1r1 s1s1 r2r2 s2s2 r3r3 s3s3 Suitable for batch
processing on HDFS data Open source implementation of Googles
framework
- Slide 18
- ~ Googles Bigtable A collection of tables, each of which
represents a sparse, distributed, persistent multidimensional
sorted map Source: Bonaldo Big Table by Alain Gilles
- Slide 19
- Image Source: Chang et al., OSDI 2006 in a nutshell (row key,
column family, column qualifier, timestamp) value row keys are
lexicographically sorted column families define locality groups
Client-side operations: gets, puts, deletes, range scans
- Slide 20
- Warcbase An open-source platform for managing web archives
built on and H and http://warcbase.org/ Source: Wikipedia
(Archive)
- Slide 21
- WARC data Ingestion Applications and Services Processing &
Analytics
- Slide 22
- Scalability? We got 99 problems but scalability aint one Jay- Z
Scalability of Warcbase limited by Hadoop/HBase Applications are
lightweight clients
- Slide 23
- WARC data Ingestion Applications and Services Processing &
Analytics text analysis, link analysis,
- Slide 24
- Sample dataset: crawl of the 108 th U.S. Congress Monthly
snapshots, January 2003 to January 2005 1.15 TB gzipped ARC files
29 million captures of 7.8 million unique URLs 23.8 million
captures are HTML files Hadoop/HBase cluster: 16 nodes, dual
quad-core processors, 3 2TB disks each
- Slide 25
- OpenWayBack + Warcbase Integration Implementation: OpenWayback
frontend for rendering Offloads storage management to HBase
Transparently scales out with HBase
- Slide 26
- Slide 27
- Slide 28
- Topic Model Explorer Implementation LDA on each temporal slice
Adaptation of Termite visualization
- Slide 29
- Slide 30
- Webgraph Explorer Implementation Link extraction with Hadoop,
site-level aggregation Computation of standard graph statistics d3
interactive visualization
- Slide 31
- Slide 32
- Slide 33
- Slide 34
- We need deep collaborations between: Users (e.g., archivists,
journalists, historians, digital humanists, etc.) Tool builders (me
and my colleagues)
- Slide 35
- Warcbase:here.
- Slide 36
- Warcbase in a box Wide Web scrape (2011) Internet Archive
wide0002 crawl Sample of 100 WARC files (~100 GB) Warcbase running
on the Mac Pro: Ingestion in ~1 hour Extraction of webgraph using
Pig in ~55 minutes Result: 17m links,.ca subset of 1.7m links
Visualization in Gephi cylinder
- Slide 37
- Internet Archive, Wide Web Scrape from 2011
- Slide 38
- But they can probably afford a Mac Pro Whats the big deal?
Historians probably cant afford Hadoop clusters How will this
change historical scholarship? Visual graph analysis on
longitudinal data, select subsets for further textual analysis all
on your desktop! Drill down to examine individual pages
- Slide 39
- Warcbase:here. Bonus!
- Slide 40
- Raspberry Pi Experiments Columbia Universitys Human Rights Web
Archive 43 GB of crawls from 2008 (1.68 million records) Warcbase
running on the Raspberry Pi (standalone mode) Ingestion in ~27
hours (17.3 records/second) Random browsing (avg over 100 pages):
2.1s page load Same pages in Internet Archive: 2.9s page load Draws
2.4 Watts Jimmy Lin. Scaling Down Distributed Infrastructure on
Wimpy Machines for Personal Web Archiving. Temporal Web Analytics
Workshop 2015.
- Slide 41
- Throw in search, lightweight analytics, Whats the big deal?
Store every page youve ever visited in your pocket! What will you
do with the web in your pocket? How will this change how you
interact with the web? When was the last time you searched to
refind?
- Slide 42
- We need deep collaborations between: Users (e.g., archivists,
journalisms, historians, digital humanists, etc.) Tool builders (me
and my colleagues) Warcbase is just the first step Goal: tools to
support exploration and discovery
- Slide 43
- Source: Wikipedia (Hammer) Questions?
- Slide 44
- Bigtable use case: storing web crawls! Row key: domain-reversed
URL Column family: contents Raw data Value: raw HTML Column
qualifier: Column family: anchor Derived data Value: anchor text
Column qualifier: source URL
- Slide 45
- Warcbase: HBase data model Row key: domain-reversed URL Column
family: c Value: raw source Column qualifier: MIME type Timestamp:
crawl date