1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 1 March 9, 2005

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 1

March 9, 2005

http://www.ee.technion.ac.il/courses/049011

2

Course OverviewPart 1: Web Architecture of search engines Information retrieval

Index construction, Vector Space Model, evaluation criteria Ranking methods

Google’s PageRank, Kleinberg’s Hubs & Authorities Spectral methods

Latent Semantic Indexing Web measurements

Properties of the web graph, random sampling of web pages Random graph models for the web Rank aggregation

3

Course OverviewPart 2: Algorithms Random sampling

Mean, quantiles, property testing Data streams

Distinct elements, Lp distances Sketching

Shingling, Hamming distance Dimension reduction

Low distortion embeddings, nearest neighbor schemes Complexity of massive data sets

Communication complexity, lower bound for frequency moments

No class on 15/6/05

4

Prerequisites

Algorithms and data structures Complexity analysis Hashing

Probability theory Conditional probabilities, expectation, variance

Linear algebra Matrices, eigenvalues, vector spaces

Combinatorics Graph theory

5

Course Requirements & Grading

Research project 80% of final grade Scribing lecture notes 10% of final grade

(due 1 week after lecture) Readings & participation 10% of final grade

6

Research ProjectsOption 1: Applied research project

Identify an interesting research problem in web search / web mining / information retrieval. Propose a solution, implement the solution, run experiments.

Should be done in groups of 3-4.

Deliverables: 1 page proposal (due: 20/4/05) 10 page final report (due: 29/6/05) 20 minute class presentation (29/6/05)

Hopefully, will lead to a paper…

7

Research ProjectsOption 2: Papers review Write a critical review of 2-3 papers on a subject not

studied in class.

Should be done in single.

Deliverables: Choice of papers to review (due: 1/6/05) 5 page review report (due: 29/6/05) 20 minute class presentation (29/6/05)

8

Textbooks

Mining the Web

by Soumen Chakrabarti

Modern Information Retrieval

by Ricardo Baeza-Yates and Berthier Ribeiro-Neto

Randomized Algorithms

by Rajeev Motwani and Prabhakar Raghavan

9

Instructor

Ziv Bar-Yossef Tel: 5737 Email: zivby@ee Office hours:

Sundays, 15:30-17:30 at 917 Meyer Mailing list: ee049011s-l

10

Large Data SetsExamples, Challenges, and Models

11

Examples of Large Data Sets:Astronomy

• Astronomical sky surveys

• 120 Gigabytes/week

• 6.5 Terabytes/year

The Hubble Telescope

12

Examples of Large Data Sets:Genomics

• 25,000 genes in human genome

• 3 billion bases

• 3 Gigabytes of genetic data

13

Examples of Large Data Sets:Phone call billing records

• 250M calls/day

• 60G calls/year

• 40 bytes/call


14

Examples of Large Data Sets:Credit card transactions

• 142 billion transactions in 2004 in US alone

• 115 Terabytes of data transmitted to processing center in 2004

15

Examples of Large Data Sets:Internet traffic

Traffic in a typical router:

• 42 bytes/second

• 3.5 Gigabytes/day


16

Examples of Large Data Sets:The World-Wide Web

• 8 Billion pages indexed

• 10kB/Page

• 8 Terabytes of indexed text data

• “Deep web” is supposedly 100 times as large

17

Reasons for the Emergence of Large Data Sets:Better technology

Storage & disksCheaperMore volumePhysically smallerMore efficient

Large data sets are affordable

18

Reasons for the Emergence of Large Data Sets:Better networking

High speed Internet Cellular phones Wireless LAN

More data consumers

More data producers

19

Reasons for the Emergence of Large Data Sets:Better IT More processes are automatic

E-commerce and V-commerce Online and telephone banking Online and telephone customer service E-learning Chats, news, blogs Online journals Digital libraries

More enterprises are computerized Companies Banks Governmental institutions Universities

More data is available in digital form

World’s yearly production of data:

5 billion Gigabyes

20

Reasons for the Emergence of Large Data Sets:Growing needs Science

Astronomy Earth and environmental studies Meteorology Genetics

Business Billing Mining customer data

More incentive to construct large data sets

Intelligence Emails Web sites Phone calls

Search Web pages Images Audio & Video

21

Characteristics of large data sets Huge Distributed

Dispersed over many servers

Dynamic Items add/deleted/modified continuously

Heterogeneous Many agents access/update data

Noisy Inherent Unintentional Malicious

Unstructured / semi-structured No database schema

22

New challengesRestricted access

Large data sets are kept on magnetic devices

Access to data is sequential Random access is costly

23

New challengesStringent efficiency requirements

Traditionally, “efficient” algorithmsRun in (small) polynomial time.Use linear space.

For large data sets, efficient algorithmsMust run in linear or even sub-linear time.Must use up to poly-logarithmic space.

24

New challengesSearch the data

Traditionally, input data is:Either small and thus easily searchableModerately large, but organized in database

tables. In large data sets, input data is:

ImmenseDisorganized, unstructured, non-standardized

Hard to find what you want

25

New challengesMine the data

Association rules“Beers and diapers”

Patterns Statistical data Graph structure

26

New challengesClean the data

Noise in data distorts Computation results Search results Mining results

Need automatic methods for “cleaning” the data Spam filters Duplicate elimination Quality evaluation

27

Approximation of

Abstract model of computing

Data

(n is very large)

• Approximation of f(x) is sufficient

• Program can be randomized

Computer Program

ExamplesMean

Parity

28

Models for Computing over Large Data Sets

Random sampling Data Streams Sketching

29

Query a few data items

Random Sampling

Data

(n is very large)

Computer Program

Examples

Mean

O(1) queries

Parity

n queries

Approximation of

30

Random Sampling

AdvantagesUltra-efficient

Sub-linear running time & space (could even be independent of data set size)

DisadvantagesMay require random accessDoesn’t fit many problemsHard to sample from disorganized data sets

31

Data Streams

Data

(n is very large)

Computer Program

Stream through the data;Use limited memory

Examples

Mean

O(1) memory

Parity

1 bit of memory

Approximation of

32

Data Streams

AdvantagesSequential accessLimited memory

DisadvantagesRunning time is at least linearToo restricted for some problems

33

Sketching

Data1

(n is very large)

Data2Data1 Data2Sketch2Sketch1

Compress eachdata segment intoa small “sketch”

Compute overthe sketches

Examples

Equality

O(1) size sketch

Hamming distance

O(1) size sketch

Lp distance (p > 2)

n1-2/p) size sketch

Approximation of

34

Sketching

AdvantagesAppropriate for distributed data setsUseful for “dimension reduction”

DisadvantagesToo restricted for some problemsUsually, at least linear running time

35

Algorithms for Large Data Sets

• Mean and other moments

• Median and other quantiles

• Volume estimations

• Histograms

• Graph problems

• Property testing

Sampling• Distinct elements

• Frequency moments

• Lp distances

• Geometric problems

• Graph problems

• Database problems

Data Streams

• Equality

• Hamming distance

Sketching• Edit distance

• Resemblance

36

Relations among the Models

Non-adaptiveSampling

AdaptiveSampling

Data StreamsSketching

always derives

sometimes derives

37

WebHistory and

Architecture of Search Engines

38

A Brief History of the Internet

1961: First paper on packet switching (Kleinrock, MIT) 1966: ARPANET (first design of a wide area computer network) 1969: First packet sent from UCLA to SRI. 1971: First E-mail 1974: Transmission Control Protocol (TCP) 1978: TCP splits into TCP and IP (Internet Protocol) 1979: USENET (newsgroups) 1984: Domain Name Servers (DNS) 1988: First Internet worm 1990: The World-Wide Web (Tim Berners-Lee, CERN)

39

A Brief History of the Web

1945: Hypertext (Vannevar Bush) 1980: Enquire (First hypertext browser) 1990: WorldWideWeb )First web browser) 1991: HTML and HTTP 1993: Mosaic (Mark Andressen) 1994: First WWW conference 1994: W3C 1994: Lycos (First commercial search engine) 1994: Yahoo! (First web directory, Jerry Yang and David

Filo) 1995: AltaVista (DEC) 1997: Google (First link-based search engine, Sergey

Brin and Larry Page)

40

Basic Terminology

Hypertext: document connected to other documents by links.

World-Wide Web: corpus of billions of hypertext documents (“pages”) that are stored on computers connected to the Internet. Documents are written in HTML Documents can be viewed using Web browsers

41

Information Retrieval

Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. Example: Get documents about Java, except for ones

that are about the Java coffee.

Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Example: Get all documents containing the term

“Java” but no containing the term “coffee”.

42

Information Retrieval vs. Data Retrieval

Information RetrievalData Retrieval

DataFree text, unstructuredDatabase tables, structured

QueriesKeywords, free textSQL, relational algebras

ResultsApproximate matchesExact matches

ResultsOrdered by relevanceUnordered

AccessibilityNon-expert humansKnowledgeable users or automatic processes

43

Information Retrieval Systems

IR System

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Corpus

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

44

Search EnginesSearch Engine

queryprocessor

textprocessor

user query

ranked retrieved

docs

User

Web

rankingprocedure

system query

retrieved docs

index

indexertokenized

docs

postings

crawlerglobal

analyzerrepository

45

Classical IR vs. Web IRClassical IRWeb IR

VolumeLargeHuge

Data qualityClean, no dupsNoisy, dups

Data change rateInfrequentIn flux

Data accessibilityAccessiblePartially accessible

Format diversityHomogeneousWidely diverse

DocumentsTextHTML

# of matchesSmallLarge

IR techniquesContent-basedLink-based

46

End of Lecture 1

47

Basic Terminology

HTML (Hypertext Markup Language): format for writing hypertext documents on the World-Wide Web

<HTML> <HEAD>

<TITLE> This is the title of the page </TITLE>

</HEAD>

<BODY BGCOLOR=“WHITE”>

<CENTER>

<H1> The header of the page </H1>

</CENTER>

Here is an image: <IMG SRC = “http://www.cnn.com/logo.jpj”>.

And here is a <A HREF=“http://www.ee.technion.ac.il”>hyperlink</A>.

</BODY> </HTML>

48

Basic Terminology HTTP (Hypertext Transport Protocol): protocol used

between a Web server and a client (e.g., Web browser) to transfer documents from the server to the client.

>> GET /http.html Http1.1>> Host: www.http.header.free.fr>> Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,>> Accept-Language: Fr>> Accept-Encoding: gzip, deflate>> User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)>> Connection: Keep-Alive

<< HTTP/1.1 200 OK<< Date: Mon, 12 Mar 2001 19:12:16 GMT<< Server: Apache/1.3.12 (Unix) Debian/GNU mod_perl/1.24<< Last-Modified: Fri, 22 Sep 2000 14:16:18<< ETag: "dd7b6e-d29-39cb69b2"<< Accept-Ranges: bytes<< Content-Length: 3369<< Connection: close<< Content-Type: text/html<< << <HTML><HEAD>…