27
MOTIVATION IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need: Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)

IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need: Find

Embed Size (px)

Citation preview

Page 1: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

MOTIVATION IR: representation, storage,

organization of, and access to information items

Focus is on the user information need

User information need: Find all docs containing information on college

tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.

Emphasis is on the retrieval of information (not data)

Page 2: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

Motivation►IR at the center of the stage

IR in the last 20 years:►classification and categorization►systems and languages►user interfaces and visualization

Still, area was seen as of narrow interest

Advent of the Web changed this perception once and for all

►universal repository of knowledge ►free (low cost) universal access►no central editorial board►many problems though: IR seen as key

to finding the solutions!

Page 3: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

Motivation►Data retrieval

which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure!

►Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated

►IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important

Page 4: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

WHAT IS IR?Information retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information need fromwithin large collections (usually stored on computers).

Page 5: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

WHAT IS IR?

Salton-1989:”Information-retrieval systems process files of records and requests for information, and identify and retrieve from files certain records is response to the information requests. the retrieval depends on similarity between records and queries.

Kowalski-1997:” An information retrieval systems is a systems that is capable of storage, retrieval, and maintenance of information.

Page 6: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

WHAT IS IR?

IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information.

IR involves helping users find information that matches their information needs.

System-centered

View

User-centered

Page 7: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

IR SYSTEMS

IR systems contain three components: System People Documents (information items)

User

System

Documents

Page 8: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

Basic Concepts►The User Task

Retrieval

►information or data Browsing (main objectives are not clearly defined and

might change during the interaction)►Task of Hypertest

Modern Digital libraries and web interfaces

-try to combine retreival + browsing► WWW(retrieval & browsing) either Pulling or Pushing

Retrieval

Browsing

Database

Page 9: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

Basic Concepts► Logical view of the documents

• Text operations or transformationReduce the complexity of the document

representation and allow moving the logical view from that of full text to that of a set of index terms

► Several intermediate logical views (of a document) might be adopted

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Page 10: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

LOGICAL VIEW OF THE DOCUMENTS

► “Stop Words” Certain words are considered irrelevant and not

placed in the bag(e.g., the, HTML Tage like <H1>) “Stemming” and other cotent analysis Stimming:Reduce terms to their roots before

indexing Use English-specific rules, convert word to their

basic form example( surfing, surfed surfIdentification of noun groups:eliminates adjectives,

adverbs, and verbs ► logical view of docs might shift

Page 11: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

TYPICAL IR SYSTEM

Processor

Document

Input

Feedback

Output

Queries

Page 12: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

UserInterface

Text Operations

Query Operations Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

The Retrieval Process

Page 13: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

BASIC CONCEPTS

Text Operation: forms index words (token) Tokenization-Stopwords removal Stemming

Indexing: construct an inverted index of words to document pointers. Mapping from keyword to document ids

Searching: retrieves documents that contain a given query token from the inverted index

Ranking: scores all retrieved documents according to a relevance metric A ranking is based on fundamental premises regarding the notion

of relevance such as: Common set of index terms Sharing of weight terms Likelihood of relevance

Each set of premises leads to distinct IR Model

Page 14: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

BASIC CONCEPTS

User Interface: Manages interaction with user: Query input and document output Visualization of result Relevance feedback

Query operation: transform the query to improve retrieval: Query expansion using a thesaurus.chpt2 Query transformation using relevance feedback

chpt2

Page 15: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

BASIC CONCEPTS

Simplest notion of relevance is that the query string appears verbatim (order is important) in the document

Slightly less strict notion is that the word in the query appears frequently in the document, in any order (bag of words)

May not retrieve relevant documents that include synonymous terms Restaurant vs. café

May retrieve irrelevant document that include ambiguous terms Apple (company or fruit) Bit (unit of data or act of eating)

Page 16: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

DATA AND INFORMATION

Data String of symbols associated with objects, people,

and events Values of an attribute

Data need not have meaning to everyone Data must be interpreted with associated attributes.

Page 17: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

DATA AND INFORMATION Information

The meaning of the data interpreted by a person or a system

Data that changes the state of a person or system that perceives it.

Data that reduces uncertainty. if data contain no uncertainty, there are no information with

the data. Examples: It snows in the winter.

It does not snow this winter.

Page 18: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

INFORMATION AND KNOWLEDGE knowledge

Structured information through structuring, information becomes

understandable Processed Information

through processing, information becomes meaningful and useful

information shared and agreed upon within a community

Data

information

knowledge

Page 19: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

TEXT

Strings of ASCII symbols or Unicode structured by the author indexed by information service providers

Representation of natural languages people use To convey meanings To communicate between readers and

authors. Data or information?

If it can be understood, it’s information. by Whom? A person or a system?

Page 20: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

DOCUMENTS

Logical unit of text articles, books, links, web pages

Other components that come with the text figures, charts, graphics multimedia

Page 21: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

TEXTUAL DATA Repository of human intellectuals

Rich and diverse resources for all answers. If it is written, it is there (in text)

Meaningful and understandable (to users). Simple ASCII representation Free of pre-formatted structures

continuous separated into documents

Easy to process by the computer Machine Intensive (not labor intensive)

Page 22: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

PROBLEMS WITH TEXT Massive

Any IR system needs the capability of large scale data processing.

Use of indexes and various representations are required.

Inconsistent It’s a human language

Syntactical and semantic variances Same information expressed in different ways. Different information expressed in similar ways.

Incomplete It uses common knowledge. It’s an open system.

Page 23: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

RETRIEVAL

Retrieval What do we retrieve?

Data Information Knowledge

We retrieve documents that contains text which carries information.

Information can be anywhere in the text, in the links, in the process of text.

Page 24: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

INFORMATION RETRIEVAL

Are they the same? Text retrieval Document retrieval Information retrieval

Page 25: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

INFORMATION RETRIEVAL

Conceptually, information retrieval is used to cover all related problems in finding needed information

Historically, information retrieval is about document retrieval, emphasizing document as the basic unit

Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.

Page 26: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

DATA RETRIEVAL VS. INFORMATION RETRIEVAL

Data retrieval Information retrieval

Content Data InformationData object Table Document Matching Exact match Partial match, best matchItems wanted Matching RelevantQuery language SQL(artificial) NaturalQuery specification Complete IncompleteModel Deterministic Probabilistic

Highly structured less structure

Page 27: IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find

SUMMARY

The goal of IR systems is to help users find information that satisfies their information needs.

The main process of IR systems is to match data abstracted from the real world to queries abstracted from user’s information needs.

Information retrieval is much more difficult than data retrieval.