Upload
daomucun
View
1.705
Download
1
Embed Size (px)
DESCRIPTION
An overview of how a web search engine is organized is provided. A key component of the AltaVista search engine: its indexing library, is described in more depth. The library manages a set of inverted files, and provides mechanisms to construct and optimize complex queries on those inverted files. The design goals were to enable efficient queries on bodies of text up to a few hundred gigabytes in size (e.g. AltaVista) without sacrificing too much generality, and without giving up on small applications (e.g. mail directories).
Citation preview
AltaVista Indexing and Search Engine
- By Mike Burrows- Recreated by Changshu Liu
[http://xcybercloud.blogspot.com
]
Goals
• General purpose• Good query performance• Scale to hundreds of gigabytes• Compact index/query representation• Queries possible during updates• Reasonable update performance
Non-Goals
• Scale beyond a terabyte• Document parsing• Query parsing• Ranking for query results
Structure of Inverted Files
• Chose flat inverted files that map words to lists of locations where those words occur
• Words are null-terminated byte strings
• Locations are 64-bit unsigned integers
• Client picks what locations mean. No predefined notion of document, page or word number
Documents
• A document is contiguous in location space
• Documents do not overlap
• Location space is allocated densely. The first document is at location 1
• Word endDoc at last location of document
• All document structure encoded with words– For example: begintitle, endtitle
Inverted File Format
• Words ordered lexicographically
• Each word followed by list of locations
• Common word prefixes are compressed
• Locations are encoded as deltas
• Deltas stored in as few bytes as possible– 2 bytes is common
• Full-text index occupies about 30% text size. Word-in-document (non-positional) index is about 10%
• Obvious format for deltas:
Continuation Bits Indicate Delta Boundaries
• Key operation: Find first location at least X
• Better format for efficient scanning: Deltas packed into aligned 64-bit word First byte contains continuation bits
Parsing a Delta• Observation:
Choose instructions to dual-issue well. Fixed word structure allows prefetch. Avoid branch mispredictions.
• 6 instr. to extract+sum+compare a deltaextql b, tp, x ; get next deltaaddq tp, l, tp ; point to next deltamskql x, l, x ; cut delta to lengthsrl l, 3, l ; get next delta lengthaddq cur, x, cur ; add delta to locationbge cur, done ; bail if done
With loop overhead, 35 instr/64-bit word.10 cycles/64-bit word.
Index Stream Reader (ISRs)An interface for reading the result of a query as an ascending sequence of locations
Lazily evaluated
ISRs are objects with methods:Loc() – Return Current LocationNext() – Advance to next locationSeek(X) – Advance to first location at least X.
Subtype ISRP adds:Prev() – Return previous location
Used for fielded queries (e.g. in title)
No methods move backwards
ISR Implementations
file — reads inverted files;seek() method is the delta parsing
loopor — merges two or more ISRsnot — returns locations not in argument ISRand — constraint solver (AND, NEAR etc)
and other, specialized ISRsand & not cannot support prev()
ISR And—constraint solver
Arguments: list of ISRs, list of ConstraintsConstraint types: (A and B are ISRs)
1. loc(A) ≤ loc(B) + K2. prev(A) ≤ loc(B) + K3. loc(A) ≤ prev(B) + K4. prev(A) ≤ prev(B) + K
If each word takes a location, constraints for two-word phrase “a b” are:
loc(A) < loc(B) loc(B) ≤ loc(A) + 1
Let E, BT, ET be ISRPs of words:enddoc, begintitle, endtitle
Constraints for conjunction: a and bprev(E) < loc(A) loc(A) ≤ loc(E)prev(E) < loc(B) loc(B) ≤ loc(E)
Constraints for field query: title: Aprev(BT) < loc(A) loc(A) ≤ loc(ET)prev(BT) < loc(ET) loc(ET) ≤ loc(BT)
Solver AlgorithmWhile (Unsatisfied Constraints)
Pick Unsatisfied Constraint()Satisfy Constraint()
To Satisfyloc(A) ≤ loc(B) + K:
seek(B, loc(A) − K)
prev(A) ≤ loc(B) + K:seek(B, prev(A) − K)
loc(A) ≤ prev(B) + K:seek(B, loc(A) − K)next(B)
prev(A) ≤ prev(B) + K:seek(B, prev(A) − K)next(B)
Some Metrics
(performance based on AltaVista Web index)20K lines of code
Indexes around 1.5GByte/Hr/600MHz CPU
Queries take about 100 cycles/query/MByte
Queries are CPU bound:95% in user space, 5% in kernel
Memory bus is currently under-utilized
Breakdown of user CPU time
30% inner loop 15% constraint solver 15% higher level seek code 7% ranking code0.2% merging results
Miss Ratios:2% I-cache8% D-cache8% level-2 cache40% level-3 cache
Postmortem
• Successes– ISRs are a good abstraction– Flat location space– Representing structure as words
• Regrets– No ability to run ISRs backwards–Wish ISR constraint solver were less
complex
AltaVista Site Architecture
- By Mike Burrows- Recreated by Changshu Liu
[http://xcybercloud.blogspot.com]
Structure of the Site
Front-Ends: Alpha Workstations
Back-Ends: 4-10 CPU Alpha Servers8GBytes RAM / 150 GBytes Disc.Organized in Groups of 4-10 MachinesEach machine has 1/Nth of the whole index
Broad Routers 0
Broad Routers 1
FDDI RouterFDDI
Router
Front End 0
Front End 1
Front End N-1
FDDI RouterFDDI
Router
Front End N
Handling Failures
• Disc: RAID controllers with spare discs
• Back-ends: front-ends use other groups
• Frond-ends: hot-spare grabs IP address
• FDDI: manual replacement of cold spare
• Site: failover via manual DNS change
RAID• Reconstructing a disc takes 30 minutes.
– Disc performance is crippled
• Except a few discs to fail a month– Need daily schedule for checking discs.
• GUI annoying when checking 60 controllers
• Once a disc failed with no error reported– Corrupted index file
• On first day, the only non-RAID device (root disc) failed during demo for press
File System
• Need a Journaling File System– Write Ahead Log– FSCK(consistency checker) takes ours
• Software/Memory errors destroy file systems– Restoring 300GB from tape doesn’t work
• Tape may be in error• Too slow
– Important to replicate data in spinning disk
Back-Ends
• Back-Ends were Digital 8400’s (Turoblaser)
• Huge cards with large connectors• Pins are on backplane, not card
• RAID setup took hours on separate machine
• Console interrupt is a boon
Front-Ends
• Biggest Problems:– Poorly-Tested software– Operator Error
• Automatic restart dealt with former
• A trivial IP failover scheme dealt with latter
HTTP Server
• Original NCSA httpd was abysmal– Forked too often– Synchronous name resolution– Logs writes to full disc– Prone to denial of service attacks
• Fixed with new first http server– Never Forks: aggravates software test
issues– Submit limits: sockets/threads/requests rate
Load Balance
• Front-End– DNS round robin
• Backend– Front-Ends will group similar queries to
the same specific backend for cache
Overload Handling
• Back-ends take short-cuts when verloaded– Ultimately, they can refuse service
• Front-ends have spare capacity to avoid site appearing completely dead
Reference
The AltaVista Indexing and Search Engine
Mike Burrows, Compaq SRCProduction Date: 01/18/2000Link: http://uwtv.org/programs/displayevent.aspx?rid=2123