Upload
nicholas-maxwell
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Circa 1971:
Courtesy of “An Atlas of Cyberspaces” (http://www.cybergeography.org/atlas/historical.ht
ml)
So, why do we need search engines?
o The web is too big.
o There is too much irrelevant information.
o Search engines bring order to this chaos filled land.
What does a search engine require?
o Know the Data
o Store the Data
o Retrieve the Results
o Order the Results
To DoPage
Retriever
URL Stream List
Quarantine
Main
Parser
The Database
Link Cache
Word Cache
Anatomy of Streaker
To DoPage
Retriever
URL Stream List
Quarantine
Main
Parser
The Database
Link Cache
Word Cache
http://www.carleton.edu
Playing nicely with the network
o Is the server responding?
o Is the server overloaded?
o How much info are we requesting?
o How fast are we sending our requests?
Throttling Streaker
WEBSERVER
1 second
DELAY = 1
3 seconds DELAY =
2
Formulas:
Pause Time: DELAY * 2
Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5
2 seconds
4 seconds
4 seconds DELAY =
3
Streaker
Streaker
Streaker
Before<html><head><title>og/le</title></head> <body> <table width="95%"><tr> <td> </td> <td> <font size="+4"><b>og/le</b></font><br> <font
size="+1">optimal guesswork/luck-based engine</font></td> <td align="right"> <font size="+3"> : carleton
search</font></td> </tr></table> <br><br> <center> <a href="http://dictionary.reference.com/search?
q=ogle">About</a> <a href="instructions.html">Instructions for Testers</a> <a href="stats.php">Statistics</a> <br><br> <form name="Ogle"> <input type="text" name="query" size="50" /><input type='hidden' name='pagenum' value='1'> <br><input type='submit' value='Ogle Carleton' /> </form> </center> <br><center><font size="-1"><p>ogling 25,49 pages</font> </center> </p><p><center><font size="-1"> <img src="streaker.gif"><br>Powered by Streaker<br><br> © 2004 Josh Allen, Andrew Drummer, Brendan Foote, Aaron Miller, Mike Ottum</font></center></p>
</body></html>
Page object
Page TextPage HeaderPage URLEtc…
Word Object(s)
The WordWord PositionOther info
Link Object(s)
Link URLLink PositionPage URLLink text
After
Which elements of a page are important?
o Texto Individual Words
oPositionoTag Information
o Linkso Link targeto Link text
Parsing Challenges
o Identical pages with different URLso Especially common with dynamically-
generated pageso Solution: Compute a checksum as we parse
and then compare it to previously seen pagesoCRC-32 Checksum Algorithm
o HTML is not a strict languageo The Parser must be flexible enough to allow
for many different types of coding, especially in tags.
Brief Databases Introduction
o Why use databases?o Data to store is too big for main memoryo Optimize disk accesses through intelligent
organization of data
o Relational Database Modelo Data is stored in tables according to
relationshipso Data is retrieved using Structured Query
Language (SQL)
Relational Example
o Relate Words to Pages
o Information that we care about:o word (string)o url (string)o position (integer)o HTML tag attributes (set)
The Non-Relational Way
word url pos tags
college http://www.carleton.edu 1 <b>
a http://www.mathcs.carleton.edu 3 <b,i>
college http://www.carleton.edu 4 <>
Why is this method bad?
o Wasted spaceo The word “college” and the URL
“http://www.carleton.edu” appear twice in this example
o In our actual crawl, the word “carleton” appears 85,496 times
o String comparisons are slow
Our Database Tables - URL
urlid url
1 http://www.carleton.edu
2 http://www.mathcs.carleton.edu
3 http://www.carleton.edu/student/
4 http://violet.mathcs.carleton.edu/ogle/search.php
WordToUrl Table Captures a Relation
o Relates Word entries to URL entries
wid urlid pos tags
2 1 1 <b>
4 2 3 <b,i>
2 1 4 <>
Executing a join Operation
o Combine the information from multiple tables to produce something meaningful
Word Table URL Table
WordToURL Table
Desired Output
wid urlid
pos tags
word string url strin
g
wid word
1 carleton
2 college
3 is
4 a
5 great
6 place
7 fhqwhgads
wid urlid pos tags
2 1 1 <b>
4 2 3 <b,i>
2 1 4 <>
WordToURL Table
Word Table
wid word
1 carleton
2 college
3 is
4 a
5 great
6 place
7 fhqwhgads
wid urlid pos tags
2 1 1 <b>
4 2 3 <b,i>
2 1 4 <>
WordToURL Table
Word Table
wid urlid pos tags
2 1 1 <b>
4 2 3 <b,i>
2 1 4 <>
urlid
url
1 www.carleton.edu
2 www.mathcs…
3 www.carleton…
4 violet.mathcs…
WordToURL Table URL Table
wid urlid pos tags
2 1 1 <b>
4 2 3 <b,i>
2 1 4 <>
urlid
url
1 www.carleton.edu
2 www.mathcs…
3 www.carleton…
4 violet.mathcs…
WordToURL Table URL Table
Heuristics
o Tools by which we return search results
o Must be accurate
o Must be fast
Problems: In general, the more complex a heuristic is, the slower it performs.
How heuristics work
o Obtain search query from usero Use query to “pull out” relevant datao Use data to retrieve all relevant pageso Use specific heuristic to order pageso Output ordered pages to user
Basic Heuristics
o Word Occurrenceo Pages order by the number of times the words in
the query appear on the page
o Frequencyo Pages order by the number of times words in the
query appear over the total number of words on the page
o Proximityo Pages ordered by the number of times words in
the query appear in the same order on the page
Meta Heuristics
o Tagso Words on a page are weighted depending
on their html tagso Pages are ordered by the sum of the
weighted words that appear on the page
Ultimate Heuristic
A combination of data and context
o frequencyo proximityo tag heuristics
o Rank of pages factored into heuristic
Problem: Using all these factors slows down searching process
Vector Space Models
A table with relationships between terms and documents:
doc1 doc2doc3
Term1 1 17 20Term2 0 0 5Term3 7 0 2
Now consider the table to be a matrix.
Then
o The columns can be seen as document vectors
o The terms serve as a basis for the vector space
o We can compare documents using vector functions
Comparing Vectors
Recall:
If we set a threshold on cos , we find the set of vectors that are within a cone around a.
ba
ba
cos
Normalizing the Data
2
12 )(
1
ijf
Since the length of the document vectors and the values both affect this calculation, we can do some pre-processing to help the heuristic.
Local Term Weighting Schemes:Binary ( fij )Term frequency fij
Logarithmic log( 1 + fij )Augmented Normalized (( fij ) + (fij/maxk fkj)) / 2
Global Term Weighting Schemes:Normal
Document Normalization Schemes:Cosine
2
12 )(
ijilg
Latent Semantic Indexing
Matrix Decomposition
If the matrix A has rank k, we can represent the matrix using k column vectors.
This has the effect of smooshing together like documents, creating relationships between terms that do not appear on the same page.
Example: if a user searches for “Samuel Clemens”, the terms appear on the same page as “Mark Twain” often enough that documents only containing “Mark Twain” will match.
o Heuristics concerning text
o Heuristics concerning the context of the text
o Heuristics concerning the context of the pages
Page Rank
What makes Page Rank different?
o Link-basedo Independent of search termso Fewer database queries during searcho Copyrighted
Ranking a page
What do you need to rank a page?o Pages that link to your pageo The ranks of those pageso The links on those pages
o Rank = 0.15 + 0.85 * Σ(Ri/Li)
o Fifty iterations
How did we do?
og/le makes your laundry whiter than any other leading brand!
Competitors:
Google (the big boys)
ht://Dig (Carleton’s current search engine)