2005 fall cs523_lecture_4

IMPLEMENTATION OF INFORMATION RETRIEVAL

SYSTEMS VIA RDBMS

Relational Database: Definitions

Relational database: a set of relationsRelation: made up of 2 parts: Instance : a table, with rows and columns.

#Rows = cardinality, #fields = degree / arity. Schema : specifies name of relation, plus name and

type of each column. E.G. Students(sid: string, name: string, login: string,

age: integer, gpa: real).

Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).

Example Instance of Students Relation

sid name login age gpa

53666 Jones jones@cs 18 3.4

53688 Smith smith@eecs 18 3.2

53650 Smith smith@math 19 3.8

Cardinality = 3, degree = 5, all rows distinct

Relational Query Languages

A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.

The SQL Query Language

Developed by IBM (system R) in the 1970sNeed for a standard since it is used by many vendorsStandards: SQL-86 SQL-89 (minor revision) SQL-92 (major revision, current standard) SQL-99 (major extensions)

The SQL Query Language

To find all 18 year old students, we can write:

SELECT *FROM Students SWHERE S.age=18

•To find just names and logins, replace the first line:

SELECT S.name, S.login

sid name login age gpa

53666 Jones jones@cs 18 3.4

53688 Smith smith@ee 18 3.2

Querying Multiple Relations

SELECT S.name, E.cidFROM Students S, Enrolled EWHERE S.sid=E.sid AND E.grade=“A”

S.name E.cid

Smith Topology112

sid cid grade53831 Carnatic101 C53831 Reggae203 B53650 Topology112 A53666 History105 B

Creating Relations in SQL

Creates the Students relation. Observe that the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified. As another example, the Enrolled table holds information about courses that students take.

CREATE TABLE Students(sid: CHAR(20), name: CHAR(20), login: CHAR(10), age: INTEGER, gpa: REAL)

CREATE TABLE Enrolled(sid: CHAR(20), cid: CHAR(20), grade:

CHAR(2))

Combining Separate Systems

Use an IR and RDBMS systems which are independent.Divide the query into two:

Structured part for the RDBMS Unstructured (text) part for the IR

Combine the results from IR and RDBMSGood for letting each vendor develop its own systemBad for data integrity, recovery, portability, and performance

User Defined Operators

Allow users to modify SQL by adding their own functions Some vendors used this approach (such as IBM DB2 text extender)Lynch and Stonebreaker defined “user defined operators” to implement information retrieval in 1988

//Retrieves documents that contain term1, term2, term3SELECT Doc_IdFROM DocWHERE SEARCH-TERM(Text, Term1, Term 2, Term3)

//Retrieves documents that contain term1, term2, term3// within a window of 5 termsSELECT Doc_IdFROM DocWHERE PROXIMITY(Text,5, Term1, Term 2, Term3)

Non-First Normal Form Approaches

Capture the many-to-many relationships into sets via nested relationsHard to implement ad-hoc queriesNo standard yet

Using RDBMS for IR

Benefits: Recovery Performance Data migration Concurrency Control Access control mechanism Logical and physical data independence

Using RDBMS for IR

Example: A bibliography that includes both structured and unstructured information

DIRECTORY (name, institution) : affiliation of the author AUTHOR(name,DocId) :authorship information INDEX (name, DocId) :terms that are used to index a

document

Using RDBMS for IR

Preprocessing SGML can be used as a starting point which is a standard

for defining parts of documents

<DOC><DOCNO> WSJ834234234 </DOCNO> <HL> How to make students suffer in IR Course </HL><DD> 03/23/87</DD><DATELINE> Sabanci, Turkey </DATELINE><TEXT>Crawler HW, Inverted Index, Querying</TEXT></DOC>

Using RDBMS for IR

Preprocessing SGML can be used as a starting point which is a standard

for defining parts of documents Use a parser together with a hash function to identify

terms Use STOP_TERM table for referencing stop words Produce three output tables

INDEX (DocId, Term, TermFrequency) : Models the inverted index

DOC (DocId, DocName, PubDate, DateLine) : Document metadata

TERM (Term, Idf) : stored the weights of each term//Construct TERM table, N is the total number of documentsINSERT INTO TERMSELECT Term,log(N/Count(*))FROM INDEXGROUP BY Term

Using RDBMS for IR

An offset can be added together with the term to be able to answer proximity queries. For example “Vice President” should occur together in the same document for relevant documents etc.

INDEX_PROX (DocId, Term, OffSet)

//Construct TERM table, N is the total number of documentsINSERT INTO INDEXSELECT DocId, Term, COUNT(*)FROM INDEX_PROXGROUP BY DocId, Term

Using RDBMS for IR

Query can be modeled as a relation as well when it is a long document

QUERY(Term,TermFreq)

Ex: “Find all news documents written on 03/03/2005 about Sabanci University

Data will be extracted from the structured fields Terms will be extracted using the inverted index

SELECT d.DocIdFROM DOC d, INDEX iWHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005” AND d.DocId = i.DocId

Using RDBMS for IR

Boolean Queries: Consists of terms with boolean operators (AND, OR, and NOT)For a single inputTerm: retrieve the document texts that contain that term

SELECT d.TextFROM DOC d, WHERE d.DocId IN (SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm)

Note that we can store the text part of a document using BLOB or CLOG (Binary or Character Large Object)

Using RDBMS for IR

Boolean Queries that contain OR

SELECT DISTINCT (i.DocId)FROM INDEX i WHERE i.Term = inputTerm1 OR i.Term = inputTerm2 OR ….. i.Term = inputTermn OR

Using RDBMS for IR

Boolean Queries that contain AND

SELECT DISTINCT (i.DocId)FROM INDEX i WHERE i.Term = inputTerm1 AND i.Term = inputTerm2 AND ….. i.Term = inputTermn AND

??

Using RDBMS for IR

Boolean Queries that contain AND (Previous Answer Was Wrong)

SELECT DISTINCT (i.DocId)FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX inWHERE i1.Term = inputTerm1 AND i2.Term = inputTerm2 AND ….. in.Term = inputTermn AND i1.DocID = i2.DocId AND i2.DocID = i3.DocId AND … in-1 = in.DocID

OR YOU CAN USE INTERSECTION

Using RDBMS for IR

Boolean Queries that contain ANDCommercial DBMSs are not able to process more than a fixed number of joins.Solution

SELECT i.DocIdFROM INDEX i, Query q WHERE i.Term = q.termGROUP BY i.DocIdHAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY)

Works only when the INDEX contains only one occurrence of a given termTogether with its frequency. No Proximity is recorded.

Using RDBMS for IR

Boolean Queries that contain ANDCommercial DBMSs are not able to process more than a fixed number of joins.Solution for terms appearing more than once in the INDEX

SELECT i.DocIdFROM INDEX i, Query q WHERE i.Term = q.termGROUP BY i.DocIdHAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY)

This is slower since DISTINC requires a sort for duplicate elimination.

Using RDBMS for IR

Boolean Queries that contain ANDCommercial DBMSs are not able to process more than a fixed number of joins.Implementation of TAND (Threshold AND) is also simple

SELECT i.DocIdFROM INDEX i, Query q WHERE i.Term = q.termGROUP BY i.DocIdHAVING COUNT(DISTINCT(i.Term)) > k

Using RDBMS for IR

Proximity Queries for terms within a specific window width

SELECT a.DocIdFROM INDEX_PROX a, INDEX_PROX b WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND b.Term IN (SELECT q.Term FROM QUERY q) AND a.DocId = b.DocId AND (a.offset –b.offset) BETWEEN 0 AND (width-1)GROUP BY a.DocId, b.DocId, a.Term, a.offsetHAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)

Using RDBMS for IR

Calculating Relevance

SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf)FROM QUERY q, INDEX i, TERM tWHERE q.Term = t.term AND i.Term = t.TermGROUP BY i.DocIdORDER BY 2 DESC

Technology

2005 fall cs523_lecture_4