Upload
britton-hardy
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
William Y. ArmsManuel Calimlim
Lucy WalleFelix Weigel
January 23, 2007
Research Seminar: The Web Lab
http://weblab.infosci.cornell.edu/
Cornell Information Science
2
The Web Lab: A Joint Project of Cornell University and the Internet Archive
Faculty
William Arms, Johannes Gehrke, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang,...
Researchers
Manuel Calimlim, Dave Lifka, Ruth Mitchell, Lucia Walle, Felix Weigel,...
Students
Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 50 M.Eng., and undergraduate students from Information Science and Computer Science
Internet Archive
Brewster Kahle, Tracey Jacquith, Michael Stack, Kris Carpenter,...
3
Introduction to the Web LabMining the History of the Web
The Internet Archive's Web Collection
• Complete crawls of the Web, every two months since 1996
• Total archive is about 110,000,000,000 pages (110 billion)
• Recent crawls are about 60+ TByte (compressed)
• Total archive is about 1,900 TByte (compressed)
• Metadata contains format, links, anchor text
4
The Library Stacks: the Internet Archive
5
The Wayback Machine
Demo:
http://www.archive.org/
6
Research using Metadata about Web Pages
Current NSF grant
Research using anchor text
• links to microsoft.com and google.com
Changes to the link structure of the Web
• differences between crawls
• densification (increases in average node degree)
Formation of online groups
7
Example of Past Work: Social and Information Networks, Joining a Community
Close to one billion (user, community) instances
Work by: Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan
8
The Never-ending Research Dialog
Here's an analysis we would like to do...
Not as you suggest it, but here's another idea...
We don't know how to do that
analysis. Would this be any use
to you?
That might be possible, with the following
modification...
RESEARCHERINFORMATION SCIENTIST
Let's try it and see.
9
The Role of Web Data for Social Science Research
Social networks are an important research topic
– Emergence of global phenomena from local effects
• Viral spreading of rumors
– Behavior of individuals in a community
• Roles in discussion threads, herd behavior in opinion polls
– Network structure and dynamics
• Strength of weak ties, triangle relations, homophily
10
How to Observe a Social Network?
• Social network research before the web
– Talk to people, make notes
– Distribute questionnaires, gather statistics
• Problems with this approach
– Tedious task
– Small scale
• The Internet Archive is a great resource for research
– Contains web pages with social networks
– Records the history of the pages
11
Social Networks on the Web
The web contains many social networks
– Sites for social networking, social bookmarking, file sharing
• MySpace, Facebook, Flickr, Delicious
– Community portals
• Yahoo Groups, DBLife
– Encyclopedia and folksonomy projects
• Wikipedia, Wikia
– Review sites and customer comments
• Amazon, Netflix
– Blogs, web forums, Usenet
12
The Bliss and Curse of Digital Data
Opportunities
– Collecting network data at an unprecedented scale
– Verifying hypotheses in many different networks
– Monitoring communities at a finer granularity
– Mining and searching social networks
Challenges
– Finding suitable information on the web
– Extracting information from web pages
– Making web data persistent
– Processing very large data sets
– Access rights and privacy
13
Web Lab and Social Science Research
• Collaboration with Cornell’s Institute for the Social Sciences
• Our goal: Make data available to researchers
– Large web graph database with multiple crawls
– Packaged subsets of crawls for analysis
– Visual extraction tool for creating new data sets (ongoing)
– Small-scale crawling for adding new web sites (starting)
– Full-text indexing (planned)
Demo of the extraction tool available at
http://www.cs.cornell.edu/~weigel/WrapperDemo/
14
Web Data Extraction
Researchers often don’t care about web pages, but specific substructures inside the pages
– Blog postings
– Web forums
– Social tagging
– News headlines
– Tables of content
– Bibliographies
– Product details
– Customer reviews
15
Web Data Collaboration Server
Data extraction
• Writing extraction code is a tedious task
• Create tools to make the data easily accessible in a structured format (e.g., tables in a database)
Data sharing
• Extracting the same data repeatedly is a waste of time and storage space
• Let users share their data and extraction rules
Data curation
• Web data is often incomplete and erroneous
• Let users collaborate to correct and complete the data
16
Demonstration
Demo of the extraction tool available athttp://www.cs.cornell.edu/~weigel/WrapperDemo/
17
The Web Lab System
Wayback Machine
INTERNET ARCHIVE
Text indexes
Web Collection
File server
Computer cluster
Text indexesPage storeStructure database
National super-
computers
CORNELL UNIVERSITY
18
Technical Processing: the Web Lab
Networking Internet 2, National Lambda Rail
Wayback Machine Commodity computers withlocal file systems
Structure database Relational database system on large shared memory computer
Data analysis Specialized Linux cluster withHadoop distributed file system and MapReduce programming
Different types of computer for different functions
19
The Research Process
Select a sub-set for analysis
• SQL query the relational database directly
• Use the GetPages tool on the Web site to send an SQL query
Download the sub-set
• To the researcher's computer
• To the Web Lab file server
Clean-up the data
• MapReduce tasks on the Hadoop cluster
Data analysis
• MapReduce tasks on the Hadoop cluster
20
Selection Methods
By known identifier (Wayback Machine)
web pages with the URL http://www.nsf.gov/
By character string (full text indexing) -- future
all pages containing, "Internet is doubling every six months"
all page containing the SARS-CoV genetic sequence
By metadata criteria
all web pages that link to microsoft.com but not to google.com
all email addresses that I used to receive mail from but have not had mail from recently*
* Example provided by Marc Smith
21
Benefits of Using a Relational Database
• Simple query language for retrieving data
• Transaction support
• Concurrency control for parallel queries
• Multiple indices for high performance
• Reliability since databases have built-in recovery functionality
22
Metadata Loading
• The crawler outputs compressed metadata files (DAT files).
• Each DAT file has a set of crawled pages with page metadata, including things like crawl time, IP address, mime type, language encoding, etc.
• Most importantly, the outgoing links from each page are parsed, including the full URL and associated anchor text.
23
Database Schema
Crawl – Name of the crawl from which data is loaded
Page – Metadata about each webpage plus fields to help find and extract the full html text
Link – The outgoing links from crawled pages
Url – Lookup table for unique URLs
Host – Lookup table for unique hostnames
24
Crawls Loaded Into SQL DB
Crawl Period Database size
Pages Links Urls Hosts
DJ Jan-April 2002 2.5 TB 1.1 billion 26 billion 250 million
16 million
DV Jan-April 2004 15 TB 1.3 billion 110 billion TBD TBD
EB Jan-March 2005
20 TB 3 billion 130 billion 20 billion 380 million
Amazon Jan-April 2004, Jan-August 2005
570 GB 40 million 3 billion 35 million 356
Cornell Jan-April 2002, Jan-April 2004
5 GB 800,000 12 million 750,000 40,000
25
Selection from the Database
• SQL query the relational database directly
(Contact Manuel Calimlim)
• Use the GetPages tool on the Web site to send an SQL query -- work in progress
26
Demonstration
Demonstration of the Web Lab web site
http://weblab.infosci.cornell.edu/
and the GetPages tool
27
Massive Data Analysis by Non-Specialists
A typical scientist or social scientist:
• Has deep domain knowledge
• Has good algorithmic understanding
• Is often a competent computer user or has a research assistant who is familiar with languages such as Fortran, Python, and Matlab, or applications packages such as SAS and Excel.
But...
• Has limited understanding of large-scale data analysis
• Is not skilled at any form of computing that requires parallel computing or concurrency
Typical problem of scale: Given 100 billion URLs, how do you identify duplicates?
28
Hadoop and MapReduce Programming Hadoop
An open source distributed file system similar to the Google File System. It supports MapReduce programming.
http://lucene.apache.org/hadoop/
MapReduce
A functional programming style to support large-scale data analysis without the need for global data structures.
In the 1960s, Fortran gave scientists a simple way to translate mathematical problems into efficient computer codes.
MapReduce programming gives researchers a simple way to run massive data analysis on large computer clusters.
29
The MapReduce Paradigm
split 0split 1split 2split 3split 4
Output 0
Output 1
Input data split into files
Output files
M map tasks
R reduce tasks
Intermediate files
Each intermediate file is divided into R partitions
Each reduce task corresponds to one partition
30
A Web Graph Example
1
2
34
5 6
31
Building the Web Graph
URLs, pages, and links:
• URLs contained in Web pages may link to pages never crawled
• URLs not canonicalized: different URLs may refer to same page
• Links are from a page to a URL
Web graph from crawl data:
• Nodes are union of pages crawled and URLs seen
• Each node and edge has time interval(s) over which it exists
32
Web Graph Example
Problem:
Given a set of URL pairs in uncanonicalized form (u0, v0), create a list of all the edges that point to each node of the web graph:
• Replace each u0 or v0 with its canonicalized form u or v.
• Create a list of all nodes of the graph, i.e., the set of unique u.
• Discard all (u, v) pairs, where u = v, or v is not a node of the graph.
• Discard all duplicate edges.
• For each node v, create a list (v, {u}), where {u} is the set of nodes that have edges to node v.
Each step is a simple programming task for a small numbers of links on a single computer. How can this simplicity be retained with huge numbers of links on a very large computer cluster?
33
MapReduce Example
Map task
Input: (u0, v0)
Output: (u, d) // Indicate that u is a from-URL(v, u) // Indicate that v is a to-URL with link from u
d is a dummy marker. Do not output if u = v.
This is simple application code to write.
34
A MapReduce Example
Merge
The input to the reduce process merges the output values from the map task that correspond to each URL.
For each URL, w, it creates a list:
w, {d, ... , d, u1, ..., uk}
This merge is performed automatically by the system libraries.
35
A MapReduce Example
Reduce
Input: w, {d, ... , d, u1, ..., uk}, where w is any URL.
Output:
If there is no marker d in the list, discard and do not output. This corresponds to a URL that never appears only as the first element of a (u, v) pair.
Otherwise remove duplicates from u1, ..., uk and output.
The output is a to-URL and a list of the nodes that link to it:
v, {u1, ..., uk}
This is simple application code to write.
36
For the Future:Examples of Tools and Services
The Web Lab is steadily building a set of tools for researchers
• API and Web services
• GetPages Web forms to select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc.
• Focused Web crawling (modification of Heritrix crawler)
• Extraction of Web graph from subset and calculations, e.g., PageRank, hubs and authorities
• Graph visualization
• Natural language processing of anchor text
37
The Web Lab is Ready for Use
We are ready to work with a number of researchers:
Systems
Relational database operational
Hadoop pilot cluster (large cluster soon)
File server and web server operational
People
Manuel Calimlim (database)
Lucy Walle (Hadoop + MapReduce)
Tools
A variety of tools in prototype
Experience with large volumes of anchor text and URLs
38
Thanks
This work would not be possible without the forethought and long standing commitment of Brewster Kahle and the Internet Archive to capture and preserve the content of the Web for future generations.
This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, SES-
0537606, IIS-0634677, and IIS-0705774.
William Y. ArmsManuel Calimlim
Lucy WalleFelix Weigel
January 23, 2007
Research Seminar: The Web Lab
http://weblab.infosci.cornell.edu/
Cornell Information Science