Click here to load reader

File Processing - Indexing MVNC1 Indexing Jim Skon

  • View

  • Download

Embed Size (px)

Text of File Processing - Indexing MVNC1 Indexing Jim Skon

IndexingConsider a library card catalog
Allows quick access to books
Why not just order books by author name?
Actually three indexes:
Provides a shortcut, based on a key value, to desired.
Each index based on a certain key(s) value
Can have indexs for any key field
File Processing - Indexing MVNC
Consider an index file which which contains records which contain:
Primary Key (Record label + Record ID)
Byte Offset
File Processing - Indexing MVNC
Search index file(perhaps using binary file)
Seek in main file to the byte offset specified in index
Read record from main file
File Processing - Indexing MVNC
Load the index file into memory
Rewrite the index file after index change
Add records to the file and index
Delete records from data file
Update records in data file
File Processing - Indexing MVNC
Create new files
File Processing - Indexing MVNC
Load the index file into memory
Simply index index in sequential order, placing into an array of (key,offset) structures
Since the records are small, could read several records at once
File Processing - Indexing MVNC
Simply iterate through array, writing to index file
Can be done after EVERY change
Could wait until files are ready to be closed
Need to keep track of whether file version is outof date
File Processing - Indexing MVNC
Add record to main file
Next free record
Maybe a linked list of “unused” records could be used to keep track of available records.
Record order of main file unimportant
Add record to index
Could put at end, sorting occasionally.
File Processing - Indexing MVNC
Delete in main file
Delete in index
Perhaps just mark as deleted
Could still search of key field still intact
File Processing - Indexing MVNC
Will need to move entry in index
Can be thought of as a delete followed by an insert
If change does not change key field
Case one - record does not move
just rewrite record
Perhaps the record in variable size, and it grows
Index will have to changed to reflect new position
Position of reference in index unchanged
File Processing - Indexing MVNC
Not much better then searching a sorted complete file
Orders of magnitude more expensive then in memory index management
File Processing - Indexing MVNC
In such cases consider
A hash file system
However, a file based index still has benefits
Allows binary searching on unordered file
Allows binary searching on variable length records
Indexes are smaller then main files, so somewhat cheaper to manipulate
Allows file “rearrangement” without moving actual records. (Consider when pinned)
File Processing - Indexing MVNC
Indexing with multiple keys
Consider an additional index for access to album file by composer
Secondary index: fields
Every time record moved in main file, ALL indexes must change
The indexes pin the records!
File Processing - Indexing MVNC
Refer to primary kay rather then offset to actual record
Now secondary key index doesn’t reference actual records, records not pinned.
Main file can be reorganized without changing secondary index
File Processing - Indexing MVNC
Search secondary index (binary search?)
If found, use associated primary key to look up record in primary index
Use offset in primary index to lookup actual record
remember - the secondary key may contain multiple matches (E.g. Beethoven)
A secondary key can be thought of a refering to a subset of records
File Processing - Indexing MVNC
Add record in main file and primary index as before
Add entry in primary in index
Add entry in secondary file
As before, shift data as needed.
Duplicate keyed index entry stored together.
Duplicate’s should be stored in primary key order
File Processing - Indexing MVNC
Costly if many secondary indexes
simply leave in secondary indexes
search in primary index will fail, indicating record not available
Failed searches longer, but file management simpler (faster)
File Processing - Indexing MVNC
Updating records
The fact that secondary indexes refer to primary key insolates secondary indexes from most updates
Records can move in main file without effecting secondary index
Change in secondary key
If a secondary key value changes, then we must change the key value in secondary index, requiring secondary index reordering
Orther secondary indexes unchanged
File Processing - Indexing MVNC
Change of primary key value
All secondary indexes must be updated to refer to the new key value
Since the secondary key is uncanged, no reorganization required in secondary indexes - just rewrite index entries in same spot
Usually one index entry needs updating per secondary index.
The main record itself will simplifying looking up associated reference in secondary index!
File Processing - Indexing MVNC
Find all records of Beethoven’s work
Find all records of “Violin Concerto”
All require single index!
File Processing - Indexing MVNC
Now consider:
Find all records with composer = “Beethoven” and title = “Symphony No. 9”.
Method one:
Search composer index for those matching Beethoven. This yields a list of primary keys.
Next search title index for those matching “Symphony No. 9”. This also yields a list of primary keys.
Now intersect the two primary key lists. This is a list of primary keys for record which match the query.
File Processing - Indexing MVNC
General Strategies
and queries: Intersect primary keys lists
or queries: Union primary keys lists
Point: Complex queries can be performed accessing only the matching records!
File Processing - Indexing MVNC
Consider problems with this secondary index structure:
we have to rearrange the index file every time a new record is add!
If we add anew version of Beethoven’s Symphony No. 9, we would have to add a new element to both the composer and the title indexes
If there are duplicate secondary keys, the seconary key value is stored in the secondary index once for every record with the secondary key!
Beethoven is stored in secondary index once for every Beethoven record in the main file.
Waste of space!
Inverted lists
Solution one:
Increase secondary index record size to include a list of all primary keys with matching values.
Solves the two problems
Wastes space!
File Processing - Indexing MVNC
The Bible Index is a type of an Inverted List
Works ok since never updated
If updates needed, MANY records would have to be moved
File Processing - Indexing MVNC
A list of secondary keys (all unique)
Each entry contains a pointer to a list of primary key references
Now each key value stored exactly once
But how do we maintain the lists of primary key references?
Solution - linked lists!
Two data structures
A list of secondary keys, with pointers into a list of references
A list if references, each with a (next) pointer, which refers to another reference in list, or null
File Processing - Indexing MVNC
Inverted lists
The secondary key list is no bigger then the number of distinct secondary key values
Can be often stored in RAM
Lookups - binary search
Maintained as a linked list of free records
records added by delinked from free list, and linked into the appropriate secondary key’s list.
record can be deleted by removing from the key’s link listed and linked into a free list.
File Processing - Indexing MVNC
Consider a “special” index for Christain music

Search related