William Y. Arms Manuel Calimlim Lucy Walle Felix Weigel January 23, 2007 Research Seminar: The Web Lab Cornell Information

William Y. ArmsManuel Calimlim

Lucy WalleFelix Weigel

January 23, 2007

Research Seminar: The Web Lab

http://weblab.infosci.cornell.edu/

Cornell Information Science

2

The Web Lab: A Joint Project of Cornell University and the Internet Archive

Faculty

William Arms, Johannes Gehrke, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang,...

Researchers

Manuel Calimlim, Dave Lifka, Ruth Mitchell, Lucia Walle, Felix Weigel,...

Students

Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 50 M.Eng., and undergraduate students from Information Science and Computer Science

Internet Archive

Brewster Kahle, Tracey Jacquith, Michael Stack, Kris Carpenter,...

3

Introduction to the Web LabMining the History of the Web

The Internet Archive's Web Collection

• Complete crawls of the Web, every two months since 1996

• Total archive is about 110,000,000,000 pages (110 billion)

• Recent crawls are about 60+ TByte (compressed)

• Total archive is about 1,900 TByte (compressed)

• Metadata contains format, links, anchor text

4

The Library Stacks: the Internet Archive

5

The Wayback Machine

Demo:

http://www.archive.org/

6

Research using Metadata about Web Pages

Current NSF grant

Research using anchor text

• links to microsoft.com and google.com

Changes to the link structure of the Web

• differences between crawls

• densification (increases in average node degree)

Formation of online groups

7

Example of Past Work: Social and Information Networks, Joining a Community

Close to one billion (user, community) instances

Work by: Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan

8

The Never-ending Research Dialog

Here's an analysis we would like to do...

Not as you suggest it, but here's another idea...

We don't know how to do that

analysis. Would this be any use

to you?

That might be possible, with the following

modification...

RESEARCHERINFORMATION SCIENTIST

Let's try it and see.

9

The Role of Web Data for Social Science Research

Social networks are an important research topic

– Emergence of global phenomena from local effects

• Viral spreading of rumors

– Behavior of individuals in a community

• Roles in discussion threads, herd behavior in opinion polls

– Network structure and dynamics

• Strength of weak ties, triangle relations, homophily

10

How to Observe a Social Network?

• Social network research before the web

– Talk to people, make notes

– Distribute questionnaires, gather statistics

• Problems with this approach

– Tedious task

– Small scale

• The Internet Archive is a great resource for research

– Contains web pages with social networks

– Records the history of the pages

11

Social Networks on the Web

The web contains many social networks

– Sites for social networking, social bookmarking, file sharing

• MySpace, Facebook, Flickr, Delicious

– Community portals

• Yahoo Groups, DBLife

– Encyclopedia and folksonomy projects

• Wikipedia, Wikia

– Review sites and customer comments

• Amazon, Netflix

– Blogs, web forums, Usenet

12

The Bliss and Curse of Digital Data

Opportunities

– Collecting network data at an unprecedented scale

– Verifying hypotheses in many different networks

– Monitoring communities at a finer granularity

– Mining and searching social networks

Challenges

– Finding suitable information on the web

– Extracting information from web pages

– Making web data persistent

– Processing very large data sets

– Access rights and privacy

13

Web Lab and Social Science Research

• Collaboration with Cornell’s Institute for the Social Sciences

• Our goal: Make data available to researchers

– Large web graph database with multiple crawls

– Packaged subsets of crawls for analysis

– Visual extraction tool for creating new data sets (ongoing)

– Small-scale crawling for adding new web sites (starting)

– Full-text indexing (planned)

Demo of the extraction tool available at

http://www.cs.cornell.edu/~weigel/WrapperDemo/

14

Web Data Extraction

Researchers often don’t care about web pages, but specific substructures inside the pages

– Blog postings

– Web forums

– Social tagging

– News headlines

– Tables of content

– Bibliographies

– Product details

– Customer reviews

15

Web Data Collaboration Server

Data extraction

• Writing extraction code is a tedious task

• Create tools to make the data easily accessible in a structured format (e.g., tables in a database)

Data sharing

• Extracting the same data repeatedly is a waste of time and storage space

• Let users share their data and extraction rules

Data curation

• Web data is often incomplete and erroneous

• Let users collaborate to correct and complete the data

16

Demonstration

Demo of the extraction tool available athttp://www.cs.cornell.edu/~weigel/WrapperDemo/

17

The Web Lab System

Wayback Machine

INTERNET ARCHIVE

Text indexes

Web Collection

File server

Computer cluster

Text indexesPage storeStructure database

National super-

computers

CORNELL UNIVERSITY

18

Technical Processing: the Web Lab

Networking Internet 2, National Lambda Rail

Wayback Machine Commodity computers withlocal file systems

Structure database Relational database system on large shared memory computer

Data analysis Specialized Linux cluster withHadoop distributed file system and MapReduce programming

Different types of computer for different functions

19

The Research Process

Select a sub-set for analysis

• SQL query the relational database directly

• Use the GetPages tool on the Web site to send an SQL query

Download the sub-set

• To the researcher's computer

• To the Web Lab file server

Clean-up the data

• MapReduce tasks on the Hadoop cluster

Data analysis

• MapReduce tasks on the Hadoop cluster

20

Selection Methods

By known identifier (Wayback Machine)

web pages with the URL http://www.nsf.gov/

By character string (full text indexing) -- future

all pages containing, "Internet is doubling every six months"

all page containing the SARS-CoV genetic sequence

By metadata criteria

all web pages that link to microsoft.com but not to google.com

all email addresses that I used to receive mail from but have not had mail from recently*

* Example provided by Marc Smith

21

Benefits of Using a Relational Database

• Simple query language for retrieving data

• Transaction support

• Concurrency control for parallel queries

• Multiple indices for high performance

• Reliability since databases have built-in recovery functionality

22

Metadata Loading

• The crawler outputs compressed metadata files (DAT files).

• Each DAT file has a set of crawled pages with page metadata, including things like crawl time, IP address, mime type, language encoding, etc.

• Most importantly, the outgoing links from each page are parsed, including the full URL and associated anchor text.

23

Database Schema

Crawl – Name of the crawl from which data is loaded

Page – Metadata about each webpage plus fields to help find and extract the full html text

Link – The outgoing links from crawled pages

Url – Lookup table for unique URLs

Host – Lookup table for unique hostnames

24

Crawls Loaded Into SQL DB

Crawl Period Database size

Pages Links Urls Hosts

DJ Jan-April 2002 2.5 TB 1.1 billion 26 billion 250 million

16 million

DV Jan-April 2004 15 TB 1.3 billion 110 billion TBD TBD

EB Jan-March 2005

20 TB 3 billion 130 billion 20 billion 380 million

Amazon Jan-April 2004, Jan-August 2005

570 GB 40 million 3 billion 35 million 356

Cornell Jan-April 2002, Jan-April 2004

5 GB 800,000 12 million 750,000 40,000

25

Selection from the Database

• SQL query the relational database directly

(Contact Manuel Calimlim)

• Use the GetPages tool on the Web site to send an SQL query -- work in progress

26

Demonstration

Demonstration of the Web Lab web site


and the GetPages tool

27

Massive Data Analysis by Non-Specialists

A typical scientist or social scientist:

• Has deep domain knowledge

• Has good algorithmic understanding

• Is often a competent computer user or has a research assistant who is familiar with languages such as Fortran, Python, and Matlab, or applications packages such as SAS and Excel.

But...

• Has limited understanding of large-scale data analysis

• Is not skilled at any form of computing that requires parallel computing or concurrency

Typical problem of scale: Given 100 billion URLs, how do you identify duplicates?

28

Hadoop and MapReduce Programming Hadoop

An open source distributed file system similar to the Google File System. It supports MapReduce programming.

http://lucene.apache.org/hadoop/

MapReduce

A functional programming style to support large-scale data analysis without the need for global data structures.

In the 1960s, Fortran gave scientists a simple way to translate mathematical problems into efficient computer codes.

MapReduce programming gives researchers a simple way to run massive data analysis on large computer clusters.

29

The MapReduce Paradigm

split 0split 1split 2split 3split 4

Output 0

Output 1

Input data split into files

Output files

M map tasks

R reduce tasks

Intermediate files

Each intermediate file is divided into R partitions

Each reduce task corresponds to one partition

30

A Web Graph Example

1

2

34

5 6

31

Building the Web Graph

URLs, pages, and links:

• URLs contained in Web pages may link to pages never crawled

• URLs not canonicalized: different URLs may refer to same page

• Links are from a page to a URL

Web graph from crawl data:

• Nodes are union of pages crawled and URLs seen

• Each node and edge has time interval(s) over which it exists

32

Web Graph Example

Problem:

Given a set of URL pairs in uncanonicalized form (u0, v0), create a list of all the edges that point to each node of the web graph:

• Replace each u0 or v0 with its canonicalized form u or v.

• Create a list of all nodes of the graph, i.e., the set of unique u.

• Discard all (u, v) pairs, where u = v, or v is not a node of the graph.

• Discard all duplicate edges.

• For each node v, create a list (v, {u}), where {u} is the set of nodes that have edges to node v.

Each step is a simple programming task for a small numbers of links on a single computer. How can this simplicity be retained with huge numbers of links on a very large computer cluster?

33

MapReduce Example

Map task

Input: (u0, v0)

Output: (u, d) // Indicate that u is a from-URL(v, u) // Indicate that v is a to-URL with link from u

d is a dummy marker. Do not output if u = v.

This is simple application code to write.

34

A MapReduce Example

Merge

The input to the reduce process merges the output values from the map task that correspond to each URL.

For each URL, w, it creates a list:

w, {d, ... , d, u1, ..., uk}

This merge is performed automatically by the system libraries.

35

A MapReduce Example

Reduce

Input: w, {d, ... , d, u1, ..., uk}, where w is any URL.

Output:

If there is no marker d in the list, discard and do not output. This corresponds to a URL that never appears only as the first element of a (u, v) pair.

Otherwise remove duplicates from u1, ..., uk and output.

The output is a to-URL and a list of the nodes that link to it:

v, {u1, ..., uk}

This is simple application code to write.

36

For the Future:Examples of Tools and Services

The Web Lab is steadily building a set of tools for researchers

• API and Web services

• GetPages Web forms to select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc.

• Focused Web crawling (modification of Heritrix crawler)

• Extraction of Web graph from subset and calculations, e.g., PageRank, hubs and authorities

• Graph visualization

• Natural language processing of anchor text

37

The Web Lab is Ready for Use

We are ready to work with a number of researchers:

Systems

Relational database operational

Hadoop pilot cluster (large cluster soon)

File server and web server operational

People

Manuel Calimlim (database)

Lucy Walle (Hadoop + MapReduce)

Tools

A variety of tools in prototype

Experience with large volumes of anchor text and URLs

38

Thanks

This work would not be possible without the forethought and long standing commitment of Brewster Kahle and the Internet Archive to capture and preserve the content of the Web for future generations.

This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, SES-

0537606, IIS-0634677, and IIS-0705774.

William Y. ArmsManuel Calimlim

Lucy WalleFelix Weigel

January 23, 2007

Research Seminar: The Web Lab


Cornell Information Science

Documents

William Y. Arms Manuel Calimlim Lucy Walle Felix Weigel January 23, 2007 Research Seminar: The Web Lab Cornell Information