25
Introduction to Information Retrieval Aj. Khuanlux Mitsophonsiri CS.426 INFORMATION RETRIEVAL

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL

Embed Size (px)

Citation preview

Introduction toInformation Retrieval

Aj. Khuanlux Mitsophonsiri CS.426 INFORMATION RETRIEVAL

2

What is Information Retrieval?

Google ?Yahoo ?MSN search?What goes on behind search engine???How do they work??????

3

..Many Question

What makes a system like Google search trick?How does it gather information?What trick does it use?Extending beyond the Web

How can those approaches be made better?Natural language understanding?User interactions?

4

..Many Question

How can we do to make things work quickly?Faster computer?Caching?Compression?

How do we decide it works well?For all queries?For special types of queries?On every collection of information?

5

Motivation

IR: representation, storage, organization of, and access to information items

Focus is on the user information needUser information need:

Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.

Emphasis is on the retrieval of information (not data)

6

Motivation

Data retrieval which documents contain a set of keywords? Well defined semantics a single erroneous object implies failure!

Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated

IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important

7

Motivation

IR at the center of the stage IR in the last 20 years:

classification and categorization systems and languages user interfaces and visualization

Advent of the Web changed this perception once and for all

universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the

solutions!

8

Information Retrieval

The indexing and retrieval of textual documents.

Concerned firstly with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documents efficiently.

9

Typical IR Task

Given:A corpus of textual natural-language

documents.A user query in the form of a textual string.

Find:A ranked set of documents that are relevant

to the query.

10

Relevance

Relevance is a subjective judgment and may include:Being on the proper subject.Being timely (recent information).Being authoritative (from a trusted source).Satisfying the goals of the user and his/her

intended use of the information (information need).

11

Relevance

Much of IR depends upon idea that Similar vocabulary -> relevant to same queries

Usually look for documents matching query words “Similar” can be measured in many ways

String matching/comparison Same vocabulary used Probability that documents arise from same model Same meaning of text

12

Keyword Search

Simplest notion of relevance is that the query string appears verbatim in the document.

Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

13

Problems with Keywords

May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café”

May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

14

Intelligent IR

Taking into account the meaning of the words used.

Taking into account the order of words in the query.

Adapting to the user based on direct or indirect feedback.

Taking into account the authority of the source.

15

IR Basic Concepts

The User TaskRetrieval

information or datapurposeful

Browsingglancing aroundF1; cars, Le Mans, France, tourism

Retrieval

Browsing

Database

16

IR Basic Concepts

Logical view of documents

Document representation viewed as a continuum: logical view of documents might shift

Docs

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexing

structure Full text Index terms

17

IR System Architecture

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

Text Database

Text

18

IR System Components

Text: Operations forms index words (tokens). Stopword removal Stemming

Indexing: constructs an inverted index of word to document pointers.

Searching: retrieves documents that contain a given query token from the inverted index.

Ranking: scores all retrieved documents according to a relevance metric.

19

IR System Components

User Interface: manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results.

Query Operations: transform the query to improve retrieval: Query expansion using a thesaurus.

Query transformation using relevance feedback.

20

Web Search

Application of IR to HTML documents on the World Wide Web.

Differences: Must assemble document corpus by spidering the

web. Can exploit the structural layout information in

HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

21

Web Search System

Query String

IRSystem

RankedDocuments

1. Page12. Page23. Page3 . .

Documentcorpus

Web Spider

22

History of IR

1960-70’s: Initial exploration of text retrieval systems

for “small” corpora of scientific abstracts, and law and business documents.

Development of the basic Boolean and vector-space models of retrieval.

Prof. Salton and his students at Cornell University are the leading researchers in the area.

23

History of IR

1980’s:Large document database systems, many

run by companies:Lexis-NexisDialogMEDLINE

24

History of IR

1990’s:Searching FTP able documents on the

InternetArchieWAIS

Searching the World Wide WebYahooAltavista

25

History of IR

2000’s & continued :Link analysis for Web Search

GoogleMultimedia IR

ImageVideoAudio and music

Document Summarization