Text Retrieval

8/2/2019 Text Retrieval

1/29

Multimedia Information

Retrieval(CSC 545)

Textual Retrieval

By

Dr. Nursuriati Jamil

The problem of IR

Goal = find documents relevantto aninformation need from a large document set

Documentcollection

Info.need

Query

Answer list

IR

systemRetrieval


2/29

The retrieval problem

Given N documents (D0, ..., DN-1) Query Q of user

Problem ranked list of k documents Dj (0 word2, word35word100, word123, kena -> word67, .

Featureextraction

Querytransformation

Q = {dera, kanak-kanak, Malaysia,

seksa, pukul, hukum,budak, bayi, remaja}

Relevance rankingRSV(Q,HBJ3N129)= .2RSV(Q, HBM4N111=.4

HBM4N11HBJ3N129

Result

Inverted file:dera - HBJ3N129, HBM4N111budak -> HBJ2N19, HBJ3N129 Malaysia-> HBJN129

Retrieval

OFFLINE

Penderaan kanak-kanak di MalaysiaQuery

ONLINE

Insert

Document

Database


3/29

Feature (Terms) extraction A text retrieval system represents documents as sets of terms (e.g.,

words). Thereby, the originally structured document becomes anunstructured set of terms potentially annotated with attributes todenote frequency and position in the text. The transformationcomprises several steps:

1. Elimination of structure (i.e. formats)

2. Elimination of frequent/infrequent terms (i.e. stop words)

3. Mapping text to terms (without punctuation)

4. Reduction of terms to their stems (stemming, syllable division)

5. Mapping to index terms

(The order of the steps above may vary; often, steps are even brokeninto several steps or several steps are combined into a single pass)

Types of terms: words, phrases or n-gram (i.e., sequence of ncharacters)

Overview of feature extraction

3 Stemming


4/29


Structure elimination Frequent/infrequent terms Text -> Term

StemIndex

Step 1. Structure elimination HTML contains special markups, so-called tags. They

describe meta-information about the document and thelayout/presentation of content. An HTML document is splitinto two parts, a header section and a body section:

Header: Contains meta-informationabout thedocument; they also

describe all embeddedelements like images.

Body: Encompasses thedocument enrichedwith markups forlayout. The structureof the document is notalways obvious.


5/29

Step 1. Structure elimination (cont.)

Meta data: HTML provides several possibilities to define meta-information (-Tag). The most frequent ones are:

URL of page: http://www-dbs.ethz.ch/~mmir/

Title of document:

ETH Zurich - Homepage

Meta information in header section:


6/29

Step 1. Structure elimination (cont.)

Embedded links how to handle them:

Embedded objects (image, plug-ins):


7/29

Distribution of term frequencies

e.g.stopwords,mostfrequent

words

e.g. seldom used words

Insignificant terms Stop words are terms with little or no semantic meaning, thus often

not indexed. Examples: English: the, a, is Bahasa Melayu: ada, iaitu, mana, bersabda, wahai

Often, the rank of these terms is on the left side of the upper cut-off line. Generally, stop words are responsible for 20% to 30% ofthe term occurrences in a text. With the elimination of stop words,the memory consumption of the index can be reduced.

Similarly, the most frequent terms in a collection of documentscarry little information (rank on the left side of the upper cut-offline): The term Computer is meaningless to index articles about computer

science. The term Computer, however, is important to distinguish between

general articles such as careers in computer science.

Analogously, one can strip offwords that are seldom used. Thisassumes that users will not use them in their queries (the rank ison the right side of the lower cut-off). Although, the additionalmemory consumption is rather small.


8/29


2 Removestopwords

Step 3: Mapping text to terms To select appropriate features for documents, one typically uses

linguistic or statistical approaches to define the features based onwords, fragments of words or phrases.

Most search engines use words or phrases as features. Someengines use stemming, some differentiate between upper andlower cases, and some support error correction.

An interesting option is the usage of fragments, i.e., so-called n-grams. Although not directly related to semantics of text, they arevery useful to support fuzzy retrieval.

But there are other possibilities: fragments of words, i.e., n-grams:Example: street -> str, tre, ree, eet

streets -> str, tre, ree, eet, etsstrets -> str, tre, ret, ets

Benefits: Simple misspellings or bad recognition often result in bad

retrievals; fragments significantly improve retrievalquality.

Stemming and syllable division not necessary any more better. No language specific retrieval necessary; every language is

processed equally


9/29

Locations and frequency of terms Retrieval algorithms often use the number of term

occurrences and the positions of terms within the documentto identify and rank results.

Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Tiin document Dj

Term frequency is important to rank documents. Term locations (feature locations): loc(Ti,dj) ->P(N) [set

of locations] Term locations frequently influence the ranking and whether

a document appears in the result at all, e.g.: Condition: Q =shah NEAR alam (explicit phrase

matching) looking for documents with the terms shahand alam close to each other

Ranking: Q =shah alam (implicit phrase matching)documents with the terms shah next to alam shouldbe at the top of results.

tf = term frequency frequency of a term/keyword in a document

The higher the tf, the higher the importance (weight) forthe doc.

df = document frequency no. of documents containing the term

distribution of the term

idf = inverse document frequency the unevenness of term distribution in the corpus

the specificity of term to a document

The more the term is distributed evenly, the less it isspecific to a document

weight(t,D) = tf(t,D) * idf(t)

tf*idf weighting schema


10/29

Example

Term #of docs --> Dj, tfj Dj, tfj

Haji 3 --> D7, 4 D26,10

Iman 5 --> D21, 2 ....

Term Haji occurs in three documents, 4

times in doc 7, 10 times in doc 26 and5 times in doc 40.

Dj, tfj

D40, 5

.

Some common tf*idfschemes

tf(t, D)=freq(t,D) idf(t) = log(N/n)

tf(t, D)=log[freq(t,D)] n = #docs containing t

tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus

tf(t, D)=freq(t,d)/Max[f(t,d)]

weight(t,D) = tf(t,D) * idf(t)


11/29


Term Pos #Doc Dj,tfj Dj,tfj Dj,tfj

Abdul 5 2 10, 1 21, 2

Agong 4 3 2, 3 6, 5 31, 2

: : : : : :

3 Text toterm

Step 4: Stemming How word stemming works?

Stemming broadens our results to include bothword roots and word derivations. It is commonlyaccepted that removal of word-endings (sometimescalled suffix stripping) is a good idea; removal ofprefixes can be useful in some subject domains.

Why do we need word stemming in the context of freetext searching? Free text-searching, searches exactly as we type in

to the search box, without changing it to thesaurusterm.

Morphological variants of words have similarsemantic interpretations.

Smaller dictionary size results in a saving ofstorage space and processing time.


12/29

Word stemming (cont.) Algorithms for Word Stemming

A stemming algorithm is an algorithm that converts aword to a related form. One of the simplest suchtransformations is conversion of plurals to singulars.

Affix removal algorithms, Successor Variety, TableLookup, N-gram

In most languages, words have various inflected (orsometimes, derived) forms. The different forms should notcarry different meanings but should be mapped to a singleform.

However, in many languages, it is not simple to derive thelinguistic stem without a dictionary. At least for English,

there exist algorithms without the need of a dictionarywhich still produce good results (Porter Algorithm).

Word stemming (cont.)

Pros & Cons

Word Stemmers are used to conflate terms to improveretrieval effectiveness and/or to reduce the size ofindexing files increase recall at the cost of decreasedprecision

Over stemming and Under Stemming also create aproblem for retrieving the documents.


13/29

Porter's Algorithm

The Porter Stemmer is a conflation Stemmer developed byMartin Porter at the University of Cambridge in 1980.

Porter stemming algorithm (or 'Porter stemmer') is a processfor removing the commoner morphological and inflexionalendings from words in English.

Most effective and widely used.

Porter's Algorithm works based on number of vowelcharacters, which are followed be a consonant character in thestem (Measure), must be greater than one for the rule to beapplied.

A word can have any one of the forms: CC, C..V, V..V,V..C.

These can be represented as [C](VC){m}[V].

Porter's Algorithm (cont.)

The rules in the Porter algorithm are separated into fivedistinct steps numbered from 1 to 5. They are applied to thewords in the text starting from step 1 and moving on to step5.

Step 1 deals with plurals and past participles. The subsequentsteps are much more straightforward.

Ex. plastered->plaster, motoring-> motor

Step 2 deals with pattern matching on some common suffixes.

Ex. happy -> happi, relational -> relate, callousness ->callous

Step 3 deals with special word endings.

Ex. triplicate-> triplic, hopeful-> hope


14/29

Porter's Algorithm (cont.) Step 4 checks the stripped word against more suffixes in case the

word is compounded.

Ex. revival -> reviv, allowance-> allow, inference-> infer etc.,

Step 5 checks if the stripped word ends in a vowel and fixes itappropriately

Ex. probate -> probat, cease -> ceas, controll -> control

The algorithm is careful not to remove a suffix when the stem istoo short, the length of the stem being given by its measure,m. There is no linguistic basis for this approach.

Dictionary-based stemming A dictionary significantly improves the quality of stemming (Note:

the Porter Algorithm does not derive a linguistic correct stem). Itdetermines the correct linguistic stem for all words but at theprice of additional lookup costs and maintenance costs for thedictionary.

The EuroWordNet initiative tries to develop a semantic dictionaryfor the European languages. Next to words, the dictionary shallcontain flexed forms and relations between words (see nextsection). However, the usage of these dictionaries is not for free(with the exception of WordNet for English). Names remain aproblem of their own...

Examples of such dictionaries / ontologies: EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://wordnet.princeton.edu/

We look a dictionary based stemming with the example ofMorphy, the stemmer of WordNet. Morphy combines twoapproaches for stemming: a rule-based approach for regular flexions much like the porter

algorithm but muchsimpler an exception list with strong or irregular flexions of terms


15/29

Stemming process

Unstemmedwords

Stemmingalgorithm

Porters algortihm,Fatimahs algorithm,Wordnet dictionary

Stopwords

Is it a stopword?

Morphologicalrules

(e.g. ber..an,me+, +lah)

Apply prefix-suffix,suffix & infix rules

Worddictionary

Stemmedwords

Is it in dictionary?

Step 5: Mapping to index terms Term extraction must further deal with homonyms (equal terms but

different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to

consider. In the following, a list of the most common relationships: Homonyms (equal terms but different semantics):

bank (shore vs. financial institute) Synonyms (different terms but equal semantics):

walk, go, pace, run, sprint Hypernyms (umbrella term) / Hyponym (species)

Animal -> dog, cat, bird, ... Holonyms (is part of) / Meronyms (has parts)

door ->lock The relationships above define a network (often denoted as ontology)

with terms as nodes and relations as edges. An occurrence of a termmay be interpreted as occurrences of near-by terms in this network aswell (whereby near-by has to be defined appropriately). Example: A document contains the term dog. We may also

interpret this as an occurrence of the term animal (with a smallerweight)


16/29

Step 5 : (cont.) Some search engine do not implement step 4 and 5. Google only

recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual

approaches that (automatically) translate or relate terms to differentlanguages and make them retrievable even for queries in differentlanguages than the document.

Term extraction for queries: Similar to term extraction of documents If term extraction of query implements step 5:

Omit step 5 in term extraction of documents in the collection Extend the query terms with near-by terms:

Expansion with synonyms: Q=house Qnew=house, home,domicile, ...

If a specialized search returns not enough answers, exchangekeywords with their hypernyms: e.g., Q=mare (femalehorse) -> Qnew=horse

If a general search term returns too many results, let the userchoose (i.e. relevance feedback) a more specialized term toreduce the result list: e.g., Q=horse-> Qnew=mare, pony,chestnut, pacer

What is WordNet? A large lexical database, or electronic

dictionary, developed and maintained atPrinceton University

http://wordnet.princeton.edu

Includes most English nouns, verbs, adjectives,adverbs

Electronic format makes it amenable to

automatic manipulation Used in many Natural Language Processing

applications (information retrieval, text mining,question answering, machine translation,AI/reasoning,...)

Wordnets are built for many languages.


17/29

Whats special about WordNet?

Traditional paper dictionaries are organizedalphabetically: words that are found together(on the same page) are not related by meaning

WordNet is organized by meaning: words inclose proximity are semantically similar

Human users and computers can browseWordNet and find words that are meaningfullyrelated to their queries (somewhat like in ahyperdimensional thesaurus)

Meaning similiarity can be measured and

quantified to support Natural LanguageUnderstanding

A simple picture

animal (animate, breathes, hasheart,...)

|

bird (has feathers, flies,..)

|canary (yellow, sings nicely,..)


18/29

Hypo-/hypernymy relates noun synsets

Creates relationships among more/less general concepts

Creates hierarchies. Hierarchies can have up to 16levels

{vehicle}

/ \

{car, automobile} {bicycle, bike}

/ \ \

{convertible} {SUV} {mountain bike}

A car is a kind ofvehicle

The class of vehicles includes cars, bikes

Hyponymy

Transitivity:

A car is a kind ofvehicle

An SUV is a kind ofcar

=> An SUV is a kind ofvehicle


19/29

Meronymy/holonymy

(part-whole relation)

{car, automobile}

|

{engine}

/ \

{spark plug} {cylinder}

An engine hasspark plugs

Spark plus and cylinders areparts ofan engine

Meronymy/Holonymy

Inheritance:

A finger ispart ofa hand

A hand ispart ofan arm

An arm ispart ofa body

=>a finger ispart ofa body


20/29

Structure of WordNet (Nouns)

{vehicle}

{conveyance; transport}

{car; auto; automobile; machine; motorcar}

{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }

{motor vehicle; automotive vehicle}

{bumper}

{car door}

{car window}

{car mirror}

{hinge; flexible joint}

{doorlock}

{armrest}

hyperonym

hyperonym

hyperonym

hyperonymhyperonym

meronym

meronym

meronym

meronym

Homework

Select 5 most frequent noun terms, findhomonyms, synonyms, hypernmys andholonyms of the terms. May use Wordnetat http://wordnet.princeton.edu/. SelectUse Wordnet Online.

Create the noun ontology.
http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/


21/29

IR models

Overview

Boolean Retrieval

Fuzzy Retrieval

Vector Space Retrieval

Probabilistic Retrieval (BIR Model)

Latent Semantic Indexing

Boolean search


22/29

Boolean model Historically:

Documents were stored on tapes or punched cards.

Searching: only sequential access.

Today:

Boolean search is still very frequent but is not state-of-the-art.. Google uses it for simplicity but furtehr improved it byadditionally sorting/ranking results sets.

Model:

Document D represented by binary vector dwith di=1 if termtioccurs in document i.

Query q comes from query space Q; let tbe an arbitrary term,

and q1 and q2 be queries from Q; Q is given by queries oftype:

t,q1 ^ q2, , q1 v q2, q1

Boolean model (cont.)


23/29

Term-document matrix

Query: Brutus AND Caesar AND NOT Calpurnia

Take the vectors for Brutus, Caesar and Calpurnia, complement the last,and then do a bitwise AND:

110100 AND 110111 AND 101111 = 100100

Boolean retrieval

Query: Brutus AND Caesar AND NOT Calpurnia


24/29

Fuzzy retrieval

Fuzzy retrieval (cont.)


25/29

Vector-space model Since Boolean models binary weights too limiting, vector

supports partial matching.

Non-binary weights are assigned to index terms in queriesand documents.

Term weights are used to compute degree of similaritybetween documents in the database and the users query.

term3 = malam

term2 = ibadah

term1 = solat

q d

Vector-space model (cont.)

The tf metric is considered an indication of how well a term characterizes the content of a

document. The idf, in turn, reflects the number of documents in the collection in which

the term occurs, irrespective of the number of times it occurs in those documents.


26/29

Inverse document frequency

Document-Term-Matrix


27/29

Vector-space model (cont.)

Example

N = #of documents

M= # of terms


28/29

a arrived

gold silver truck

Class exercises

Using selected, most frequent 10 terms inyour story, create term-document matrixfor boolean model and vector model.


29/29

Remarks There are many more methods to determine the vector

representations and to compute retrieval status values Main assumption of vector space retrieval Terms occur independent from each other in documents Not true: if one writes about Mercedes, the term "car" is likely to co-

occur in document

Advantages: Simple model with efficient evaluation algorithms Partial match queries possible, i.e., it returns documents that only

partly contain the query terms (similar to or-operator of Booleanretrieval)

Very good retrieval quality; but not state-of-the-art Relevance feedback may further improve vector space retrieval

Disadvantages: Many heuristics and simplification; no proof for "correctness" of result

set HTML/Web: occurrences of terms is not the most important criteria to

rank documents (spamming)