View
217
Download
3
Embed Size (px)
Citation preview
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 1
March 9, 2005
http://www.ee.technion.ac.il/courses/049011
2
Course OverviewPart 1: Web Architecture of search engines Information retrieval
Index construction, Vector Space Model, evaluation criteria Ranking methods
Google’s PageRank, Kleinberg’s Hubs & Authorities Spectral methods
Latent Semantic Indexing Web measurements
Properties of the web graph, random sampling of web pages Random graph models for the web Rank aggregation
3
Course OverviewPart 2: Algorithms Random sampling
Mean, quantiles, property testing Data streams
Distinct elements, Lp distances Sketching
Shingling, Hamming distance Dimension reduction
Low distortion embeddings, nearest neighbor schemes Complexity of massive data sets
Communication complexity, lower bound for frequency moments
No class on 15/6/05
4
Prerequisites
Algorithms and data structures Complexity analysis Hashing
Probability theory Conditional probabilities, expectation, variance
Linear algebra Matrices, eigenvalues, vector spaces
Combinatorics Graph theory
5
Course Requirements & Grading
Research project 80% of final grade Scribing lecture notes 10% of final grade
(due 1 week after lecture) Readings & participation 10% of final grade
6
Research ProjectsOption 1: Applied research project
Identify an interesting research problem in web search / web mining / information retrieval. Propose a solution, implement the solution, run experiments.
Should be done in groups of 3-4.
Deliverables: 1 page proposal (due: 20/4/05) 10 page final report (due: 29/6/05) 20 minute class presentation (29/6/05)
Hopefully, will lead to a paper…
7
Research ProjectsOption 2: Papers review Write a critical review of 2-3 papers on a subject not
studied in class.
Should be done in single.
Deliverables: Choice of papers to review (due: 1/6/05) 5 page review report (due: 29/6/05) 20 minute class presentation (29/6/05)
8
Textbooks
Mining the Web
by Soumen Chakrabarti
Modern Information Retrieval
by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
Randomized Algorithms
by Rajeev Motwani and Prabhakar Raghavan
9
Instructor
Ziv Bar-Yossef Tel: 5737 Email: zivby@ee Office hours:
Sundays, 15:30-17:30 at 917 Meyer Mailing list: ee049011s-l
11
Examples of Large Data Sets:Astronomy
• Astronomical sky surveys
• 120 Gigabytes/week
• 6.5 Terabytes/year
The Hubble Telescope
12
Examples of Large Data Sets:Genomics
• 25,000 genes in human genome
• 3 billion bases
• 3 Gigabytes of genetic data
13
Examples of Large Data Sets:Phone call billing records
• 250M calls/day
• 60G calls/year
• 40 bytes/call
• 2.5 Terabytes/year
14
Examples of Large Data Sets:Credit card transactions
• 142 billion transactions in 2004 in US alone
• 115 Terabytes of data transmitted to processing center in 2004
15
Examples of Large Data Sets:Internet traffic
Traffic in a typical router:
• 42 bytes/second
• 3.5 Gigabytes/day
• 1.3 Terabytes/year
16
Examples of Large Data Sets:The World-Wide Web
• 8 Billion pages indexed
• 10kB/Page
• 8 Terabytes of indexed text data
• “Deep web” is supposedly 100 times as large
17
Reasons for the Emergence of Large Data Sets:Better technology
Storage & disksCheaperMore volumePhysically smallerMore efficient
Large data sets are affordable
18
Reasons for the Emergence of Large Data Sets:Better networking
High speed Internet Cellular phones Wireless LAN
More data consumers
More data producers
19
Reasons for the Emergence of Large Data Sets:Better IT More processes are automatic
E-commerce and V-commerce Online and telephone banking Online and telephone customer service E-learning Chats, news, blogs Online journals Digital libraries
More enterprises are computerized Companies Banks Governmental institutions Universities
More data is available in digital form
World’s yearly production of data:
5 billion Gigabyes
20
Reasons for the Emergence of Large Data Sets:Growing needs Science
Astronomy Earth and environmental studies Meteorology Genetics
Business Billing Mining customer data
More incentive to construct large data sets
Intelligence Emails Web sites Phone calls
Search Web pages Images Audio & Video
21
Characteristics of large data sets Huge Distributed
Dispersed over many servers
Dynamic Items add/deleted/modified continuously
Heterogeneous Many agents access/update data
Noisy Inherent Unintentional Malicious
Unstructured / semi-structured No database schema
22
New challengesRestricted access
Large data sets are kept on magnetic devices
Access to data is sequential Random access is costly
23
New challengesStringent efficiency requirements
Traditionally, “efficient” algorithmsRun in (small) polynomial time.Use linear space.
For large data sets, efficient algorithmsMust run in linear or even sub-linear time.Must use up to poly-logarithmic space.
24
New challengesSearch the data
Traditionally, input data is:Either small and thus easily searchableModerately large, but organized in database
tables. In large data sets, input data is:
ImmenseDisorganized, unstructured, non-standardized
Hard to find what you want
25
New challengesMine the data
Association rules“Beers and diapers”
Patterns Statistical data Graph structure
26
New challengesClean the data
Noise in data distorts Computation results Search results Mining results
Need automatic methods for “cleaning” the data Spam filters Duplicate elimination Quality evaluation
27
Approximation of
Abstract model of computing
Data
(n is very large)
• Approximation of f(x) is sufficient
• Program can be randomized
Computer Program
ExamplesMean
Parity
29
Query a few data items
Random Sampling
Data
(n is very large)
Computer Program
Examples
Mean
O(1) queries
Parity
n queries
Approximation of
30
Random Sampling
AdvantagesUltra-efficient
Sub-linear running time & space (could even be independent of data set size)
DisadvantagesMay require random accessDoesn’t fit many problemsHard to sample from disorganized data sets
31
Data Streams
Data
(n is very large)
Computer Program
Stream through the data;Use limited memory
Examples
Mean
O(1) memory
Parity
1 bit of memory
Approximation of
32
Data Streams
AdvantagesSequential accessLimited memory
DisadvantagesRunning time is at least linearToo restricted for some problems
33
Sketching
Data1
(n is very large)
Data2Data1 Data2Sketch2Sketch1
Compress eachdata segment intoa small “sketch”
Compute overthe sketches
Examples
Equality
O(1) size sketch
Hamming distance
O(1) size sketch
Lp distance (p > 2)
n1-2/p) size sketch
Approximation of
34
Sketching
AdvantagesAppropriate for distributed data setsUseful for “dimension reduction”
DisadvantagesToo restricted for some problemsUsually, at least linear running time
35
Algorithms for Large Data Sets
• Mean and other moments
• Median and other quantiles
• Volume estimations
• Histograms
• Graph problems
• Property testing
Sampling• Distinct elements
• Frequency moments
• Lp distances
• Geometric problems
• Graph problems
• Database problems
Data Streams
• Equality
• Hamming distance
Sketching• Edit distance
• Resemblance
36
Relations among the Models
Non-adaptiveSampling
AdaptiveSampling
Data StreamsSketching
always derives
sometimes derives
38
A Brief History of the Internet
1961: First paper on packet switching (Kleinrock, MIT) 1966: ARPANET (first design of a wide area computer network) 1969: First packet sent from UCLA to SRI. 1971: First E-mail 1974: Transmission Control Protocol (TCP) 1978: TCP splits into TCP and IP (Internet Protocol) 1979: USENET (newsgroups) 1984: Domain Name Servers (DNS) 1988: First Internet worm 1990: The World-Wide Web (Tim Berners-Lee, CERN)
39
A Brief History of the Web
1945: Hypertext (Vannevar Bush) 1980: Enquire (First hypertext browser) 1990: WorldWideWeb )First web browser) 1991: HTML and HTTP 1993: Mosaic (Mark Andressen) 1994: First WWW conference 1994: W3C 1994: Lycos (First commercial search engine) 1994: Yahoo! (First web directory, Jerry Yang and David
Filo) 1995: AltaVista (DEC) 1997: Google (First link-based search engine, Sergey
Brin and Larry Page)
40
Basic Terminology
Hypertext: document connected to other documents by links.
World-Wide Web: corpus of billions of hypertext documents (“pages”) that are stored on computers connected to the Internet. Documents are written in HTML Documents can be viewed using Web browsers
41
Information Retrieval
Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. Example: Get documents about Java, except for ones
that are about the Java coffee.
Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Example: Get all documents containing the term
“Java” but no containing the term “coffee”.
42
Information Retrieval vs. Data Retrieval
Information RetrievalData Retrieval
DataFree text, unstructuredDatabase tables, structured
QueriesKeywords, free textSQL, relational algebras
ResultsApproximate matchesExact matches
ResultsOrdered by relevanceUnordered
AccessibilityNon-expert humansKnowledgeable users or automatic processes
43
Information Retrieval Systems
IR System
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Corpus
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
44
Search EnginesSearch Engine
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Web
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
crawlerglobal
analyzerrepository
45
Classical IR vs. Web IRClassical IRWeb IR
VolumeLargeHuge
Data qualityClean, no dupsNoisy, dups
Data change rateInfrequentIn flux
Data accessibilityAccessiblePartially accessible
Format diversityHomogeneousWidely diverse
DocumentsTextHTML
# of matchesSmallLarge
IR techniquesContent-basedLink-based
47
Basic Terminology
HTML (Hypertext Markup Language): format for writing hypertext documents on the World-Wide Web
<HTML> <HEAD>
<TITLE> This is the title of the page </TITLE>
</HEAD>
<BODY BGCOLOR=“WHITE”>
<CENTER>
<H1> The header of the page </H1>
</CENTER>
Here is an image: <IMG SRC = “http://www.cnn.com/logo.jpj”>.
And here is a <A HREF=“http://www.ee.technion.ac.il”>hyperlink</A>.
</BODY> </HTML>
48
Basic Terminology HTTP (Hypertext Transport Protocol): protocol used
between a Web server and a client (e.g., Web browser) to transfer documents from the server to the client.
>> GET /http.html Http1.1>> Host: www.http.header.free.fr>> Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,>> Accept-Language: Fr>> Accept-Encoding: gzip, deflate>> User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)>> Connection: Keep-Alive
<< HTTP/1.1 200 OK<< Date: Mon, 12 Mar 2001 19:12:16 GMT<< Server: Apache/1.3.12 (Unix) Debian/GNU mod_perl/1.24<< Last-Modified: Fri, 22 Sep 2000 14:16:18<< ETag: "dd7b6e-d29-39cb69b2"<< Accept-Ranges: bytes<< Content-Length: 3369<< Connection: close<< Content-Type: text/html<< << <HTML><HEAD>…