92
og/le optimal guesswork/luck-based engine

Og/le optimal guesswork/luck-based engine. Circa 1300 BC: Ten Commandments (1956)

Embed Size (px)

Citation preview

og/leoptimal guesswork/luck-based

engine

Circa 1300 BC:

Ten Commandments (1956)

Circa 1971:

Courtesy of “An Atlas of Cyberspaces” (http://www.cybergeography.org/atlas/historical.ht

ml)

Circa 1999:

So, why do we need search engines?

o The web is too big.

o There is too much irrelevant information.

o Search engines bring order to this chaos filled land.

What does a search engine require?

o Know the Data

o Store the Data

o Retrieve the Results

o Order the Results

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

Anatomy of Streaker

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

http://www.carleton.edu

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

Playing nicely with the network

o Is the server responding?

o Is the server overloaded?

o How much info are we requesting?

o How fast are we sending our requests?

Throttling Streaker

WEBSERVER

1 second

DELAY = 1

3 seconds DELAY =

2

Formulas:

Pause Time: DELAY * 2

Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5

2 seconds

4 seconds

4 seconds DELAY =

3

Streaker

Streaker

Streaker

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

Main

To DoPage

Retriever

URL Stream List

Quarantine

Parser

The Database

Link Cache

Word Cache

Before<html><head><title>og/le</title></head> <body> <table width="95%"><tr> <td>&nbsp;&nbsp;&nbsp;&nbsp;</td> <td> <font size="+4"><b>og/le</b></font><br> <font

size="+1">optimal guesswork/luck-based engine</font></td> <td align="right"> <font size="+3"> : carleton

search</font></td> </tr></table> <br><br> <center> <a href="http://dictionary.reference.com/search?

q=ogle">About</a> &nbsp;&nbsp;&nbsp; <a href="instructions.html">Instructions for Testers</a> &nbsp;&nbsp;&nbsp; <a href="stats.php">Statistics</a> <br><br> <form name="Ogle"> <input type="text" name="query" size="50" /><input type='hidden' name='pagenum' value='1'> <br><input type='submit' value='Ogle Carleton' /> </form> </center> <br><center><font size="-1"><p>ogling 25,49 pages</font> </center> </p><p><center><font size="-1"> <img src="streaker.gif"><br>Powered by Streaker<br><br> &copy; 2004 Josh Allen, Andrew Drummer, Brendan Foote, Aaron Miller, Mike Ottum</font></center></p>

</body></html>

Page object

Page TextPage HeaderPage URLEtc…

Word Object(s)

The WordWord PositionOther info

Link Object(s)

Link URLLink PositionPage URLLink text

After

Brief HTML Introduction

Which elements of a page are important?

o Texto Individual Words

oPositionoTag Information

o Linkso Link targeto Link text

Parsing Challenges

o Identical pages with different URLso Especially common with dynamically-

generated pageso Solution: Compute a checksum as we parse

and then compare it to previously seen pagesoCRC-32 Checksum Algorithm

o HTML is not a strict languageo The Parser must be flexible enough to allow

for many different types of coding, especially in tags.

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

Pages Indexed: 54,752

Fetch Errors: 43,862

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

To DoPage

Retriever

URL Stream List

Quarantine

Main

Parser

The Database

Link Cache

Word Cache

Unique is Good

Word Word IDphilanderer 251philanthropist 252philanderer 253

mySQL queries take a long time!

mySQL queries take a long time!

We have MANY queries to make.

Our current database contains

206,493 unique words

54,752 unique urls

Google stores the complete text of

6 Billion

web pages in memory

THEN: 459 pages/hour

NOW: 3422 pages/hour

MySQL

Brief Databases Introduction

o Why use databases?o Data to store is too big for main memoryo Optimize disk accesses through intelligent

organization of data

o Relational Database Modelo Data is stored in tables according to

relationshipso Data is retrieved using Structured Query

Language (SQL)

Relational Example

o Relate Words to Pages

o Information that we care about:o word (string)o url (string)o position (integer)o HTML tag attributes (set)

The Non-Relational Way

word url pos tags

college http://www.carleton.edu 1 <b>

a http://www.mathcs.carleton.edu 3 <b,i>

college http://www.carleton.edu 4 <>

Why is this method bad?

o Wasted spaceo The word “college” and the URL

“http://www.carleton.edu” appear twice in this example

o In our actual crawl, the word “carleton” appears 85,496 times

o String comparisons are slow

Our Database Tables - Word

wid word

1 carleton

2 college

3 is

4 a

5 great

6 place

7 fhqwhgads

Our Database Tables - URL

urlid url

1 http://www.carleton.edu

2 http://www.mathcs.carleton.edu

3 http://www.carleton.edu/student/

4 http://violet.mathcs.carleton.edu/ogle/search.php

WordToUrl Table Captures a Relation

o Relates Word entries to URL entries

wid urlid pos tags

2 1 1 <b>

4 2 3 <b,i>

2 1 4 <>

Executing a join Operation

o Combine the information from multiple tables to produce something meaningful

Word Table URL Table

WordToURL Table

Desired Output

wid urlid

pos tags

word string url strin

g

wid word

1 carleton

2 college

3 is

4 a

5 great

6 place

7 fhqwhgads

Word Table

wid word

1 carleton

2 college

3 is

4 a

5 great

6 place

7 fhqwhgads

Word Table

Word Table

URL TableWordToURL

Table

Desired Output

wid urlid

pos tags

word string url strin

g

wid word

1 carleton

2 college

3 is

4 a

5 great

6 place

7 fhqwhgads

wid urlid pos tags

2 1 1 <b>

4 2 3 <b,i>

2 1 4 <>

WordToURL Table

Word Table

wid word

1 carleton

2 college

3 is

4 a

5 great

6 place

7 fhqwhgads

wid urlid pos tags

2 1 1 <b>

4 2 3 <b,i>

2 1 4 <>

WordToURL Table

Word Table

Word Table

URL TableWordToURL

Table

Desired Output

wid urlid

pos tags

word string url strin

g

wid urlid pos tags

2 1 1 <b>

4 2 3 <b,i>

2 1 4 <>

urlid

url

1 www.carleton.edu

2 www.mathcs…

3 www.carleton…

4 violet.mathcs…

WordToURL Table URL Table

wid urlid pos tags

2 1 1 <b>

4 2 3 <b,i>

2 1 4 <>

urlid

url

1 www.carleton.edu

2 www.mathcs…

3 www.carleton…

4 violet.mathcs…

WordToURL Table URL Table

Word Table

URL TableWordToURL

Table

Desired Output

wid urlid

pos tags

word string url strin

g

word url pos tags

college www.carleton.edu 1 <b>

college www.carleton.edu 4 <>

Join Result

Heuristics

o Tools by which we return search results

o Must be accurate

o Must be fast

Problems: In general, the more complex a heuristic is, the slower it performs.

How heuristics work

o Obtain search query from usero Use query to “pull out” relevant datao Use data to retrieve all relevant pageso Use specific heuristic to order pageso Output ordered pages to user

Basic Heuristics

o Word Occurrenceo Pages order by the number of times the words in

the query appear on the page

o Frequencyo Pages order by the number of times words in the

query appear over the total number of words on the page

o Proximityo Pages ordered by the number of times words in

the query appear in the same order on the page

Meta Heuristics

o Tagso Words on a page are weighted depending

on their html tagso Pages are ordered by the sum of the

weighted words that appear on the page

Ultimate Heuristic

A combination of data and context

o frequencyo proximityo tag heuristics

o Rank of pages factored into heuristic

Problem: Using all these factors slows down searching process

Vector Space Models

A table with relationships between terms and documents:

doc1 doc2doc3

Term1 1 17 20Term2 0 0 5Term3 7 0 2

Now consider the table to be a matrix.

Then

o The columns can be seen as document vectors

o The terms serve as a basis for the vector space

o We can compare documents using vector functions

Comparing Vectors

Recall:

If we set a threshold on cos , we find the set of vectors that are within a cone around a.

ba

ba

cos

Normalizing the Data

2

12 )(

1

ijf

Since the length of the document vectors and the values both affect this calculation, we can do some pre-processing to help the heuristic.

Local Term Weighting Schemes:Binary ( fij )Term frequency fij

Logarithmic log( 1 + fij )Augmented Normalized (( fij ) + (fij/maxk fkj)) / 2

Global Term Weighting Schemes:Normal

Document Normalization Schemes:Cosine

2

12 )(

ijilg

Latent Semantic Indexing

Matrix Decomposition

If the matrix A has rank k, we can represent the matrix using k column vectors.

This has the effect of smooshing together like documents, creating relationships between terms that do not appear on the same page.

Example: if a user searches for “Samuel Clemens”, the terms appear on the same page as “Mark Twain” often enough that documents only containing “Mark Twain” will match.

o Heuristics concerning text

o Heuristics concerning the context of the text

o Heuristics concerning the context of the pages

Page Rank

What makes Page Rank different?

o Link-basedo Independent of search termso Fewer database queries during searcho Copyrighted

Example Network

BC

D E

A

Ranking a page

What do you need to rank a page?o Pages that link to your pageo The ranks of those pageso The links on those pages

o Rank = 0.15 + 0.85 * Σ(Ri/Li)

o Fifty iterations

Before Ranking

BC

D E

A

1.0

1.0

1.01.0

1.0

Total = 5.00

After First Iteration

BC

D E

A1.85

0.97

1.280.77

0.54

Total = 5.41

After Second Iteration

BC

D E

A1.66

1.05

1.260.72

0.50

Total = 5.19

After Fifth Iteration

BC

D E

A1.62

0.49

1.230.70

0.49

Total = 5.06

And in Conclusion . . .

How did we do?

og/le makes your laundry whiter than any other leading brand!

Competitors:

Google (the big boys)

ht://Dig (Carleton’s current search engine)

Searching for Dave Musicant

Google Search time:

.31 seconds

Searching for Dave Musicant

ht://Dig Search time:

.5 seconds

Searching for Dave Musicant

og/leSearch time:

1.54 seconds

Searching for Aaron Miller

Google Search time:

.40 seconds

Searching for Aaron Miller

ht://Dig Search time:

.5 seconds

Searching for Aaron Miller

og/leSearch time:

.06 seconds

Future Goals

o Support Stemming

o Link Referrals

o Better Hardware

o T-shirts, coffee mugs

How to og/le

Visit us at:

http://violet.mathcs.carleton.edu/ogle/

Bibliography

Berry, Michael, and Murray Browne. Understanding Search Engines. Philadephia: SIAM, 1999.

Craven, Phil. "Google's PageRank Explained and how to make the most of it ." Web Workshop.net. <http://www.webworkshop.net/pagerank.html>