Upload
constance-lynette-walton
View
273
Download
1
Tags:
Embed Size (px)
Citation preview
MOTIVATION IR: representation, storage,
organization of, and access to information items
Focus is on the user information need
User information need: Find all docs containing information on college
tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.
Emphasis is on the retrieval of information (not data)
Motivation►IR at the center of the stage
IR in the last 20 years:►classification and categorization►systems and languages►user interfaces and visualization
Still, area was seen as of narrow interest
Advent of the Web changed this perception once and for all
►universal repository of knowledge ►free (low cost) universal access►no central editorial board►many problems though: IR seen as key
to finding the solutions!
Motivation►Data retrieval
which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure!
►Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated
►IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
WHAT IS IR?Information retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information need fromwithin large collections (usually stored on computers).
WHAT IS IR?
Salton-1989:”Information-retrieval systems process files of records and requests for information, and identify and retrieve from files certain records is response to the information requests. the retrieval depends on similarity between records and queries.
Kowalski-1997:” An information retrieval systems is a systems that is capable of storage, retrieval, and maintenance of information.
WHAT IS IR?
IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information.
IR involves helping users find information that matches their information needs.
System-centered
View
User-centered
IR SYSTEMS
IR systems contain three components: System People Documents (information items)
User
System
Documents
Basic Concepts►The User Task
Retrieval
►information or data Browsing (main objectives are not clearly defined and
might change during the interaction)►Task of Hypertest
Modern Digital libraries and web interfaces
-try to combine retreival + browsing► WWW(retrieval & browsing) either Pulling or Pushing
Retrieval
Browsing
Database
Basic Concepts► Logical view of the documents
• Text operations or transformationReduce the complexity of the document
representation and allow moving the logical view from that of full text to that of a set of index terms
► Several intermediate logical views (of a document) might be adopted
structure
Accentsspacing stopwords
Noungroups stemming
Manual indexingDocs
structure Full text Index terms
LOGICAL VIEW OF THE DOCUMENTS
► “Stop Words” Certain words are considered irrelevant and not
placed in the bag(e.g., the, HTML Tage like <H1>) “Stemming” and other cotent analysis Stimming:Reduce terms to their roots before
indexing Use English-specific rules, convert word to their
basic form example( surfing, surfed surfIdentification of noun groups:eliminates adjectives,
adverbs, and verbs ► logical view of docs might shift
TYPICAL IR SYSTEM
Processor
Document
Input
Feedback
Output
Queries
UserInterface
Text Operations
Query Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager Module
4, 10
6, 7
5 8
2
8
Text Database
Text
The Retrieval Process
BASIC CONCEPTS
Text Operation: forms index words (token) Tokenization-Stopwords removal Stemming
Indexing: construct an inverted index of words to document pointers. Mapping from keyword to document ids
Searching: retrieves documents that contain a given query token from the inverted index
Ranking: scores all retrieved documents according to a relevance metric A ranking is based on fundamental premises regarding the notion
of relevance such as: Common set of index terms Sharing of weight terms Likelihood of relevance
Each set of premises leads to distinct IR Model
BASIC CONCEPTS
User Interface: Manages interaction with user: Query input and document output Visualization of result Relevance feedback
Query operation: transform the query to improve retrieval: Query expansion using a thesaurus.chpt2 Query transformation using relevance feedback
chpt2
BASIC CONCEPTS
Simplest notion of relevance is that the query string appears verbatim (order is important) in the document
Slightly less strict notion is that the word in the query appears frequently in the document, in any order (bag of words)
May not retrieve relevant documents that include synonymous terms Restaurant vs. café
May retrieve irrelevant document that include ambiguous terms Apple (company or fruit) Bit (unit of data or act of eating)
DATA AND INFORMATION
Data String of symbols associated with objects, people,
and events Values of an attribute
Data need not have meaning to everyone Data must be interpreted with associated attributes.
DATA AND INFORMATION Information
The meaning of the data interpreted by a person or a system
Data that changes the state of a person or system that perceives it.
Data that reduces uncertainty. if data contain no uncertainty, there are no information with
the data. Examples: It snows in the winter.
It does not snow this winter.
INFORMATION AND KNOWLEDGE knowledge
Structured information through structuring, information becomes
understandable Processed Information
through processing, information becomes meaningful and useful
information shared and agreed upon within a community
Data
information
knowledge
TEXT
Strings of ASCII symbols or Unicode structured by the author indexed by information service providers
Representation of natural languages people use To convey meanings To communicate between readers and
authors. Data or information?
If it can be understood, it’s information. by Whom? A person or a system?
DOCUMENTS
Logical unit of text articles, books, links, web pages
Other components that come with the text figures, charts, graphics multimedia
TEXTUAL DATA Repository of human intellectuals
Rich and diverse resources for all answers. If it is written, it is there (in text)
Meaningful and understandable (to users). Simple ASCII representation Free of pre-formatted structures
continuous separated into documents
Easy to process by the computer Machine Intensive (not labor intensive)
PROBLEMS WITH TEXT Massive
Any IR system needs the capability of large scale data processing.
Use of indexes and various representations are required.
Inconsistent It’s a human language
Syntactical and semantic variances Same information expressed in different ways. Different information expressed in similar ways.
Incomplete It uses common knowledge. It’s an open system.
RETRIEVAL
Retrieval What do we retrieve?
Data Information Knowledge
We retrieve documents that contains text which carries information.
Information can be anywhere in the text, in the links, in the process of text.
INFORMATION RETRIEVAL
Are they the same? Text retrieval Document retrieval Information retrieval
INFORMATION RETRIEVAL
Conceptually, information retrieval is used to cover all related problems in finding needed information
Historically, information retrieval is about document retrieval, emphasizing document as the basic unit
Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.
DATA RETRIEVAL VS. INFORMATION RETRIEVAL
Data retrieval Information retrieval
Content Data InformationData object Table Document Matching Exact match Partial match, best matchItems wanted Matching RelevantQuery language SQL(artificial) NaturalQuery specification Complete IncompleteModel Deterministic Probabilistic
Highly structured less structure
SUMMARY
The goal of IR systems is to help users find information that satisfies their information needs.
The main process of IR systems is to match data abstracted from the real world to queries abstracted from user’s information needs.
Information retrieval is much more difficult than data retrieval.