Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010

Indexing

CSCI 572: Information Retrieval and Search Engines

Summer 2010

May-20-10 CS572-Summer2010 CAM-2

Outline

• Building your search corpus• Differences from RDBMS• The Document/Field Model

– The Flattening Process

– Understanding Field Types

• Challenges


Building an index

• Once you have contentin the form of metadataand extracted text, you need to persist that content– For querying

– For retrieval and display

• How should we persistthe content?


Some considerations

• Extracted metadata is typically unstructured– It’s not something that necessarily maps to a set of

Entities (Tables), with rows and with consistent columns

– Documents have different, sometimes, non-overlapping metadata models

• Dublin Core

• Word

• Climate Forecast

• The write/access patterns are a bit different– Think crawling strategies…


Databases versus Search Indices

• Databases are optimized for – Write often

– Read often

– Transactional properties in the face of the above• Atomic – operations should occur atomically, or be rolled back

• Consistent – writes/etc., should be propogated in a consistent fashion

• Isolated – transactions and modificationslimited to the entities that they modify

• Durable – expected to be running all the time and thus resilient in the face of catastrophic failure


Databases versus Search Indices

• Search Indices are optimized for – Write infrequently

– Read very frequently

– Based off a loose unstructured Document model

– Limited transactional properties• ACID not necessary

• Onus to produce results quickly

• Rollback not supported most often

• Subject to corruption

– Extremely efficient in terms of queryread times by exploiting the above


• A method of dealing with unstructured data and its persistence to an index

• Treat each indexable content item as a “Document”– Each Document has 1…N named Fields

– Each Field has 1…N values• Values can be:

– Text

– Numerical

– Hierarchical (made up of other fields)

– Complex (Geospatial, etc.)

The Document Field Model

Field1: v…vnField2: v…vn


Example: two web pages• Document 1

– Field [title], Value(s): “Chris Mattmann’s Web Page”• Type: string (text)

– Field [length], Value(s): 3026• Type: int (assumed to be bytes)

– Field [author], Value(s): Chris Mattmann

• Document 2– Field [title], Value(s): “CS572 Web Page”

• Type: string (text)

– Field [length], Value(s): 10000• Type: int, (assumed to be bytes)

– Field [author], Value(s): Chris Mattmann, Univ. of Southern California


Example: a word document

• Document 3– Field [title], Value(s): “My CS572 Final Project”


– Field [length], Value(s): 30012• Type: int (assumed to be bytes)

– Field [wordcount], Value(s): 2912• Type: int

– Field [mswordversion], Value(s): 2008, Mac• Type: string (text)


Apples to Oranges• Whether it’s an HTML

page, a Word document, a PDF file, etc.– We can still use the

Document/Field model to represent the content as itis indexed

• The Document Field model works for Metadata, but also for extracted text– Define a custom text field containing all extracted,

searchable text


What about structure?

• For example, let’s say we are extracting Person records from a RDBMS to index

• We’ve got 2 tables– Person

• Attribute: id, int PK UNIQUE AUTO INCREMENT

• Attribute: first_name VARCHAR(255)

• Attribute: last_name VARCHAR(255)

– PersonAddress• Attribute: person_id FK to Person.id

• Attribute: address_txt, VARCHAR(255)

• Attribute: zipcode, int


What about structure?

• Example records– Person:

• id, first_name, last_name

• 1, Chris, Mattmann

• 2, Homer, Simpson

– PersonAddress:• person_id, address_txt, zipcode

• 1, 1234 Joe Lane, 91354

• 2, 6344 Evergreen Terrace, Springfield, IL, 60999


What about structure?• How to get the aforementioned rows into the Document

Field model?– Flatten the structure

• Document 1– Field [first_name], Value(s): Chris


– Field [last_name], Value(s): Mattmann• Type: string (text)

– Field [id], Value(s): 1• Type: int

– Field [addresstxt], Value(s): Joe Lane• Type: string (text)

– Field [zipcode], Value(s): 91354• Type: int

Document 2Field [first_name], Value(s): Homer

Type: string (text)Field [last_name], Value(s): Simpson

Type: string (text)Field [id], Value(s): 2

Type: intField [addresstxt], Value(s): 6344 Evergreen Terrace, Springfield, IL

Type: string (text)Field [zipcode], Value(s): 60999

Type: int


Benefits of the Document Field model

• Documents are independent, wholly contained entities– Reduces ACID dependencies

– Increases the ability to become eventually consistent

• Fields can be indexed and stored in different ways– Reformatted on entry into the index, and reformatted on

the way out• Geohash great example of this

• Analyzers – implications on query model

• Tokenizers – implications on query model


Challenges• Reducing structured data to unstructured, flattened data

isn’t exactly as easy as the cooked up example– Imagine having to encode values to preserve ordering in some

fashion• Requires deep understanding of the data and methodologies for naming field

names and ordering values

• Loss of ACID properties makes it difficult to leverage index structure for search directly in transactional systems– Have to stand up search as a separate service outside of data

management system

• Determining the right tuning parameters to index– Max Buffer Size, When to Optimize, When to Merge, etc.


Wrapup

• Introduction to the Document Field indexing model

• Differences between traditional RDBMS models and Search indices

• Know when and where to use each• Search optimized for read frequent, write

infrequent