27
AltaVista Indexing and Search Engine - By Mike Burrows - Recreated by Changshu Liu [ http://xcybercloud.blogspot .com ]

Alta vista indexing and search engine

Embed Size (px)

DESCRIPTION

An overview of how a web search engine is organized is provided. A key component of the AltaVista search engine: its indexing library, is described in more depth. The library manages a set of inverted files, and provides mechanisms to construct and optimize complex queries on those inverted files. The design goals were to enable efficient queries on bodies of text up to a few hundred gigabytes in size (e.g. AltaVista) without sacrificing too much generality, and without giving up on small applications (e.g. mail directories).

Citation preview

Page 1: Alta vista  indexing and search engine

AltaVista Indexing and Search Engine

- By Mike Burrows- Recreated by Changshu Liu

[http://xcybercloud.blogspot.com

]

Page 2: Alta vista  indexing and search engine

Goals

• General purpose• Good query performance• Scale to hundreds of gigabytes• Compact index/query representation• Queries possible during updates• Reasonable update performance

Page 3: Alta vista  indexing and search engine

Non-Goals

• Scale beyond a terabyte• Document parsing• Query parsing• Ranking for query results

Page 4: Alta vista  indexing and search engine

Structure of Inverted Files

• Chose flat inverted files that map words to lists of locations where those words occur

• Words are null-terminated byte strings

• Locations are 64-bit unsigned integers

• Client picks what locations mean. No predefined notion of document, page or word number

Page 5: Alta vista  indexing and search engine

Documents

• A document is contiguous in location space

• Documents do not overlap

• Location space is allocated densely. The first document is at location 1

• Word endDoc at last location of document

• All document structure encoded with words– For example: begintitle, endtitle

Page 6: Alta vista  indexing and search engine

Inverted File Format

• Words ordered lexicographically

• Each word followed by list of locations

• Common word prefixes are compressed

• Locations are encoded as deltas

• Deltas stored in as few bytes as possible– 2 bytes is common

• Full-text index occupies about 30% text size. Word-in-document (non-positional) index is about 10%

Page 7: Alta vista  indexing and search engine

• Obvious format for deltas:

Continuation Bits Indicate Delta Boundaries

• Key operation: Find first location at least X

• Better format for efficient scanning: Deltas packed into aligned 64-bit word First byte contains continuation bits

Page 8: Alta vista  indexing and search engine

Parsing a Delta• Observation:

Choose instructions to dual-issue well. Fixed word structure allows prefetch. Avoid branch mispredictions.

• 6 instr. to extract+sum+compare a deltaextql b, tp, x ; get next deltaaddq tp, l, tp ; point to next deltamskql x, l, x ; cut delta to lengthsrl l, 3, l ; get next delta lengthaddq cur, x, cur ; add delta to locationbge cur, done ; bail if done

With loop overhead, 35 instr/64-bit word.10 cycles/64-bit word.

Page 9: Alta vista  indexing and search engine

Index Stream Reader (ISRs)An interface for reading the result of a query as an ascending sequence of locations

Lazily evaluated

ISRs are objects with methods:Loc() – Return Current LocationNext() – Advance to next locationSeek(X) – Advance to first location at least X.

Subtype ISRP adds:Prev() – Return previous location

Used for fielded queries (e.g. in title)

No methods move backwards

Page 10: Alta vista  indexing and search engine

ISR Implementations

file — reads inverted files;seek() method is the delta parsing

loopor — merges two or more ISRsnot — returns locations not in argument ISRand — constraint solver (AND, NEAR etc)

and other, specialized ISRsand & not cannot support prev()

Page 11: Alta vista  indexing and search engine

ISR And—constraint solver

Arguments: list of ISRs, list of ConstraintsConstraint types: (A and B are ISRs)

1. loc(A) ≤ loc(B) + K2. prev(A) ≤ loc(B) + K3. loc(A) ≤ prev(B) + K4. prev(A) ≤ prev(B) + K

If each word takes a location, constraints for two-word phrase “a b” are:

loc(A) < loc(B) loc(B) ≤ loc(A) + 1

Page 12: Alta vista  indexing and search engine

Let E, BT, ET be ISRPs of words:enddoc, begintitle, endtitle

Constraints for conjunction: a and bprev(E) < loc(A) loc(A) ≤ loc(E)prev(E) < loc(B) loc(B) ≤ loc(E)

Constraints for field query: title: Aprev(BT) < loc(A) loc(A) ≤ loc(ET)prev(BT) < loc(ET) loc(ET) ≤ loc(BT)

Page 13: Alta vista  indexing and search engine

Solver AlgorithmWhile (Unsatisfied Constraints)

Pick Unsatisfied Constraint()Satisfy Constraint()

To Satisfyloc(A) ≤ loc(B) + K:

seek(B, loc(A) − K)

prev(A) ≤ loc(B) + K:seek(B, prev(A) − K)

loc(A) ≤ prev(B) + K:seek(B, loc(A) − K)next(B)

prev(A) ≤ prev(B) + K:seek(B, prev(A) − K)next(B)

Page 14: Alta vista  indexing and search engine

Some Metrics

(performance based on AltaVista Web index)20K lines of code

Indexes around 1.5GByte/Hr/600MHz CPU

Queries take about 100 cycles/query/MByte

Queries are CPU bound:95% in user space, 5% in kernel

Memory bus is currently under-utilized

Page 15: Alta vista  indexing and search engine

Breakdown of user CPU time

30% inner loop 15% constraint solver 15% higher level seek code 7% ranking code0.2% merging results

Miss Ratios:2% I-cache8% D-cache8% level-2 cache40% level-3 cache

Page 16: Alta vista  indexing and search engine

Postmortem

• Successes– ISRs are a good abstraction– Flat location space– Representing structure as words

• Regrets– No ability to run ISRs backwards–Wish ISR constraint solver were less

complex

Page 17: Alta vista  indexing and search engine

AltaVista Site Architecture

- By Mike Burrows- Recreated by Changshu Liu

[http://xcybercloud.blogspot.com]

Page 18: Alta vista  indexing and search engine

Structure of the Site

Front-Ends: Alpha Workstations

Back-Ends: 4-10 CPU Alpha Servers8GBytes RAM / 150 GBytes Disc.Organized in Groups of 4-10 MachinesEach machine has 1/Nth of the whole index

Broad Routers 0

Broad Routers 1

FDDI RouterFDDI

Router

Front End 0

Front End 1

Front End N-1

FDDI RouterFDDI

Router

Front End N

Page 19: Alta vista  indexing and search engine

Handling Failures

• Disc: RAID controllers with spare discs

• Back-ends: front-ends use other groups

• Frond-ends: hot-spare grabs IP address

• FDDI: manual replacement of cold spare

• Site: failover via manual DNS change

Page 20: Alta vista  indexing and search engine

RAID• Reconstructing a disc takes 30 minutes.

– Disc performance is crippled

• Except a few discs to fail a month– Need daily schedule for checking discs.

• GUI annoying when checking 60 controllers

• Once a disc failed with no error reported– Corrupted index file

• On first day, the only non-RAID device (root disc) failed during demo for press

Page 21: Alta vista  indexing and search engine

File System

• Need a Journaling File System– Write Ahead Log– FSCK(consistency checker) takes ours

• Software/Memory errors destroy file systems– Restoring 300GB from tape doesn’t work

• Tape may be in error• Too slow

– Important to replicate data in spinning disk

Page 22: Alta vista  indexing and search engine

Back-Ends

• Back-Ends were Digital 8400’s (Turoblaser)

• Huge cards with large connectors• Pins are on backplane, not card

• RAID setup took hours on separate machine

• Console interrupt is a boon

Page 23: Alta vista  indexing and search engine

Front-Ends

• Biggest Problems:– Poorly-Tested software– Operator Error

• Automatic restart dealt with former

• A trivial IP failover scheme dealt with latter

Page 24: Alta vista  indexing and search engine

HTTP Server

• Original NCSA httpd was abysmal– Forked too often– Synchronous name resolution– Logs writes to full disc– Prone to denial of service attacks

• Fixed with new first http server– Never Forks: aggravates software test

issues– Submit limits: sockets/threads/requests rate

Page 25: Alta vista  indexing and search engine

Load Balance

• Front-End– DNS round robin

• Backend– Front-Ends will group similar queries to

the same specific backend for cache

Page 26: Alta vista  indexing and search engine

Overload Handling

• Back-ends take short-cuts when verloaded– Ultimately, they can refuse service

• Front-ends have spare capacity to avoid site appearing completely dead

Page 27: Alta vista  indexing and search engine

Reference

The AltaVista Indexing and Search Engine

Mike Burrows, Compaq SRCProduction Date: 01/18/2000Link: http://uwtv.org/programs/displayevent.aspx?rid=2123