Text Retrieval

Embed Size (px)

Citation preview

  • 8/2/2019 Text Retrieval

    1/29

    Multimedia Information

    Retrieval(CSC 545)

    Textual Retrieval

    By

    Dr. Nursuriati Jamil

    The problem of IR

    Goal = find documents relevantto aninformation need from a large document set

    Documentcollection

    Info.need

    Query

    Answer list

    IR

    systemRetrieval

  • 8/2/2019 Text Retrieval

    2/29

    The retrieval problem

    Given N documents (D0, ..., DN-1) Query Q of user

    Problem ranked list of k documents Dj (0 word2, word35word100, word123, kena -> word67, .

    Featureextraction

    Querytransformation

    Q = {dera, kanak-kanak, Malaysia,

    seksa, pukul, hukum,budak, bayi, remaja}

    Relevance rankingRSV(Q,HBJ3N129)= .2RSV(Q, HBM4N111=.4

    HBM4N11HBJ3N129

    Result

    Inverted file:dera - HBJ3N129, HBM4N111budak -> HBJ2N19, HBJ3N129 Malaysia-> HBJN129

    Retrieval

    OFFLINE

    Penderaan kanak-kanak di MalaysiaQuery

    ONLINE

    Insert

    Document

    Database

  • 8/2/2019 Text Retrieval

    3/29

    Feature (Terms) extraction A text retrieval system represents documents as sets of terms (e.g.,

    words). Thereby, the originally structured document becomes anunstructured set of terms potentially annotated with attributes todenote frequency and position in the text. The transformationcomprises several steps:

    1. Elimination of structure (i.e. formats)

    2. Elimination of frequent/infrequent terms (i.e. stop words)

    3. Mapping text to terms (without punctuation)

    4. Reduction of terms to their stems (stemming, syllable division)

    5. Mapping to index terms

    (The order of the steps above may vary; often, steps are even brokeninto several steps or several steps are combined into a single pass)

    Types of terms: words, phrases or n-gram (i.e., sequence of ncharacters)

    Overview of feature extraction

    3 Stemming

  • 8/2/2019 Text Retrieval

    4/29

    Overview of feature extraction

    Structure elimination Frequent/infrequent terms Text -> Term

    StemIndex

    Step 1. Structure elimination HTML contains special markups, so-called tags. They

    describe meta-information about the document and thelayout/presentation of content. An HTML document is splitinto two parts, a header section and a body section:

    Header: Contains meta-informationabout thedocument; they also

    describe all embeddedelements like images.

    Body: Encompasses thedocument enrichedwith markups forlayout. The structureof the document is notalways obvious.

  • 8/2/2019 Text Retrieval

    5/29

    Step 1. Structure elimination (cont.)

    Meta data: HTML provides several possibilities to define meta-information (-Tag). The most frequent ones are:

    URL of page: http://www-dbs.ethz.ch/~mmir/

    Title of document:

    ETH Zurich - Homepage

    Meta information in header section:

  • 8/2/2019 Text Retrieval

    6/29

    Step 1. Structure elimination (cont.)

    Embedded links how to handle them:

    Embedded objects (image, plug-ins):

  • 8/2/2019 Text Retrieval

    7/29

    Distribution of term frequencies

    e.g.stopwords,mostfrequent

    words

    e.g. seldom used words

    Insignificant terms Stop words are terms with little or no semantic meaning, thus often

    not indexed. Examples: English: the, a, is Bahasa Melayu: ada, iaitu, mana, bersabda, wahai

    Often, the rank of these terms is on the left side of the upper cut-off line. Generally, stop words are responsible for 20% to 30% ofthe term occurrences in a text. With the elimination of stop words,the memory consumption of the index can be reduced.

    Similarly, the most frequent terms in a collection of documentscarry little information (rank on the left side of the upper cut-offline): The term Computer is meaningless to index articles about computer

    science. The term Computer, however, is important to distinguish between

    general articles such as careers in computer science.

    Analogously, one can strip offwords that are seldom used. Thisassumes that users will not use them in their queries (the rank ison the right side of the lower cut-off). Although, the additionalmemory consumption is rather small.

  • 8/2/2019 Text Retrieval

    8/29

    Overview of feature extraction

    2 Removestopwords

    Step 3: Mapping text to terms To select appropriate features for documents, one typically uses

    linguistic or statistical approaches to define the features based onwords, fragments of words or phrases.

    Most search engines use words or phrases as features. Someengines use stemming, some differentiate between upper andlower cases, and some support error correction.

    An interesting option is the usage of fragments, i.e., so-called n-grams. Although not directly related to semantics of text, they arevery useful to support fuzzy retrieval.

    But there are other possibilities: fragments of words, i.e., n-grams:Example: street -> str, tre, ree, eet

    streets -> str, tre, ree, eet, etsstrets -> str, tre, ret, ets

    Benefits: Simple misspellings or bad recognition often result in bad

    retrievals; fragments significantly improve retrievalquality.

    Stemming and syllable division not necessary any more better. No language specific retrieval necessary; every language is

    processed equally

  • 8/2/2019 Text Retrieval

    9/29

    Locations and frequency of terms Retrieval algorithms often use the number of term

    occurrences and the positions of terms within the documentto identify and rank results.

    Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Tiin document Dj

    Term frequency is important to rank documents. Term locations (feature locations): loc(Ti,dj) ->P(N) [set

    of locations] Term locations frequently influence the ranking and whether

    a document appears in the result at all, e.g.: Condition: Q =shah NEAR alam (explicit phrase

    matching) looking for documents with the terms shahand alam close to each other

    Ranking: Q =shah alam (implicit phrase matching)documents with the terms shah next to alam shouldbe at the top of results.

    tf = term frequency frequency of a term/keyword in a document

    The higher the tf, the higher the importance (weight) forthe doc.

    df = document frequency no. of documents containing the term

    distribution of the term

    idf = inverse document frequency the unevenness of term distribution in the corpus

    the specificity of term to a document

    The more the term is distributed evenly, the less it isspecific to a document

    weight(t,D) = tf(t,D) * idf(t)

    tf*idf weighting schema

  • 8/2/2019 Text Retrieval

    10/29

    Example

    Term #of docs --> Dj, tfj Dj, tfj

    Haji 3 --> D7, 4 D26,10

    Iman 5 --> D21, 2 ....

    Term Haji occurs in three documents, 4

    times in doc 7, 10 times in doc 26 and5 times in doc 40.

    Dj, tfj

    D40, 5

    .

    Some common tf*idfschemes

    tf(t, D)=freq(t,D) idf(t) = log(N/n)

    tf(t, D)=log[freq(t,D)] n = #docs containing t

    tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus

    tf(t, D)=freq(t,d)/Max[f(t,d)]

    weight(t,D) = tf(t,D) * idf(t)

  • 8/2/2019 Text Retrieval

    11/29

    Overview of feature extraction

    Term Pos #Doc Dj,tfj Dj,tfj Dj,tfj

    Abdul 5 2 10, 1 21, 2

    Agong 4 3 2, 3 6, 5 31, 2

    : : : : : :

    3 Text toterm

    Step 4: Stemming How word stemming works?

    Stemming broadens our results to include bothword roots and word derivations. It is commonlyaccepted that removal of word-endings (sometimescalled suffix stripping) is a good idea; removal ofprefixes can be useful in some subject domains.

    Why do we need word stemming in the context of freetext searching? Free text-searching, searches exactly as we type in

    to the search box, without changing it to thesaurusterm.

    Morphological variants of words have similarsemantic interpretations.

    Smaller dictionary size results in a saving ofstorage space and processing time.

  • 8/2/2019 Text Retrieval

    12/29

    Word stemming (cont.) Algorithms for Word Stemming

    A stemming algorithm is an algorithm that converts aword to a related form. One of the simplest suchtransformations is conversion of plurals to singulars.

    Affix removal algorithms, Successor Variety, TableLookup, N-gram

    In most languages, words have various inflected (orsometimes, derived) forms. The different forms should notcarry different meanings but should be mapped to a singleform.

    However, in many languages, it is not simple to derive thelinguistic stem without a dictionary. At least for English,

    there exist algorithms without the need of a dictionarywhich still produce good results (Porter Algorithm).

    Word stemming (cont.)

    Pros & Cons

    Word Stemmers are used to conflate terms to improveretrieval effectiveness and/or to reduce the size ofindexing files increase recall at the cost of decreasedprecision

    Over stemming and Under Stemming also create aproblem for retrieving the documents.

  • 8/2/2019 Text Retrieval

    13/29

    Porter's Algorithm

    The Porter Stemmer is a conflation Stemmer developed byMartin Porter at the University of Cambridge in 1980.

    Porter stemming algorithm (or 'Porter stemmer') is a processfor removing the commoner morphological and inflexionalendings from words in English.

    Most effective and widely used.

    Porter's Algorithm works based on number of vowelcharacters, which are followed be a consonant character in thestem (Measure), must be greater than one for the rule to beapplied.

    A word can have any one of the forms: CC, C..V, V..V,V..C.

    These can be represented as [C](VC){m}[V].

    Porter's Algorithm (cont.)

    The rules in the Porter algorithm are separated into fivedistinct steps numbered from 1 to 5. They are applied to thewords in the text starting from step 1 and moving on to step5.

    Step 1 deals with plurals and past participles. The subsequentsteps are much more straightforward.

    Ex. plastered->plaster, motoring-> motor

    Step 2 deals with pattern matching on some common suffixes.

    Ex. happy -> happi, relational -> relate, callousness ->callous

    Step 3 deals with special word endings.

    Ex. triplicate-> triplic, hopeful-> hope

  • 8/2/2019 Text Retrieval

    14/29

    Porter's Algorithm (cont.) Step 4 checks the stripped word against more suffixes in case the

    word is compounded.

    Ex. revival -> reviv, allowance-> allow, inference-> infer etc.,

    Step 5 checks if the stripped word ends in a vowel and fixes itappropriately

    Ex. probate -> probat, cease -> ceas, controll -> control

    The algorithm is careful not to remove a suffix when the stem istoo short, the length of the stem being given by its measure,m. There is no linguistic basis for this approach.

    Dictionary-based stemming A dictionary significantly improves the quality of stemming (Note:

    the Porter Algorithm does not derive a linguistic correct stem). Itdetermines the correct linguistic stem for all words but at theprice of additional lookup costs and maintenance costs for thedictionary.

    The EuroWordNet initiative tries to develop a semantic dictionaryfor the European languages. Next to words, the dictionary shallcontain flexed forms and relations between words (see nextsection). However, the usage of these dictionaries is not for free(with the exception of WordNet for English). Names remain aproblem of their own...

    Examples of such dictionaries / ontologies: EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://wordnet.princeton.edu/

    We look a dictionary based stemming with the example ofMorphy, the stemmer of WordNet. Morphy combines twoapproaches for stemming: a rule-based approach for regular flexions much like the porter

    algorithm but muchsimpler an exception list with strong or irregular flexions of terms

  • 8/2/2019 Text Retrieval

    15/29

    Stemming process

    Unstemmedwords

    Stemmingalgorithm

    Porters algortihm,Fatimahs algorithm,Wordnet dictionary

    Stopwords

    Is it a stopword?

    Morphologicalrules

    (e.g. ber..an,me+, +lah)

    Apply prefix-suffix,suffix & infix rules

    Worddictionary

    Stemmedwords

    Is it in dictionary?

    Step 5: Mapping to index terms Term extraction must further deal with homonyms (equal terms but

    different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to

    consider. In the following, a list of the most common relationships: Homonyms (equal terms but different semantics):

    bank (shore vs. financial institute) Synonyms (different terms but equal semantics):

    walk, go, pace, run, sprint Hypernyms (umbrella term) / Hyponym (species)

    Animal -> dog, cat, bird, ... Holonyms (is part of) / Meronyms (has parts)

    door ->lock The relationships above define a network (often denoted as ontology)

    with terms as nodes and relations as edges. An occurrence of a termmay be interpreted as occurrences of near-by terms in this network aswell (whereby near-by has to be defined appropriately). Example: A document contains the term dog. We may also

    interpret this as an occurrence of the term animal (with a smallerweight)

  • 8/2/2019 Text Retrieval

    16/29

    Step 5 : (cont.) Some search engine do not implement step 4 and 5. Google only

    recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual

    approaches that (automatically) translate or relate terms to differentlanguages and make them retrievable even for queries in differentlanguages than the document.

    Term extraction for queries: Similar to term extraction of documents If term extraction of query implements step 5:

    Omit step 5 in term extraction of documents in the collection Extend the query terms with near-by terms:

    Expansion with synonyms: Q=house Qnew=house, home,domicile, ...

    If a specialized search returns not enough answers, exchangekeywords with their hypernyms: e.g., Q=mare (femalehorse) -> Qnew=horse

    If a general search term returns too many results, let the userchoose (i.e. relevance feedback) a more specialized term toreduce the result list: e.g., Q=horse-> Qnew=mare, pony,chestnut, pacer

    What is WordNet? A large lexical database, or electronic

    dictionary, developed and maintained atPrinceton University

    http://wordnet.princeton.edu

    Includes most English nouns, verbs, adjectives,adverbs

    Electronic format makes it amenable to

    automatic manipulation Used in many Natural Language Processing

    applications (information retrieval, text mining,question answering, machine translation,AI/reasoning,...)

    Wordnets are built for many languages.

  • 8/2/2019 Text Retrieval

    17/29

    Whats special about WordNet?

    Traditional paper dictionaries are organizedalphabetically: words that are found together(on the same page) are not related by meaning

    WordNet is organized by meaning: words inclose proximity are semantically similar

    Human users and computers can browseWordNet and find words that are meaningfullyrelated to their queries (somewhat like in ahyperdimensional thesaurus)

    Meaning similiarity can be measured and

    quantified to support Natural LanguageUnderstanding

    A simple picture

    animal (animate, breathes, hasheart,...)

    |

    bird (has feathers, flies,..)

    |canary (yellow, sings nicely,..)

  • 8/2/2019 Text Retrieval

    18/29

    Hypo-/hypernymy relates noun synsets

    Creates relationships among more/less general concepts

    Creates hierarchies. Hierarchies can have up to 16levels

    {vehicle}

    / \

    {car, automobile} {bicycle, bike}

    / \ \

    {convertible} {SUV} {mountain bike}

    A car is a kind ofvehicle

    The class of vehicles includes cars, bikes

    Hyponymy

    Transitivity:

    A car is a kind ofvehicle

    An SUV is a kind ofcar

    => An SUV is a kind ofvehicle

  • 8/2/2019 Text Retrieval

    19/29

    Meronymy/holonymy

    (part-whole relation)

    {car, automobile}

    |

    {engine}

    / \

    {spark plug} {cylinder}

    An engine hasspark plugs

    Spark plus and cylinders areparts ofan engine

    Meronymy/Holonymy

    Inheritance:

    A finger ispart ofa hand

    A hand ispart ofan arm

    An arm ispart ofa body

    =>a finger ispart ofa body

  • 8/2/2019 Text Retrieval

    20/29

    Structure of WordNet (Nouns)

    {vehicle}

    {conveyance; transport}

    {car; auto; automobile; machine; motorcar}

    {cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }

    {motor vehicle; automotive vehicle}

    {bumper}

    {car door}

    {car window}

    {car mirror}

    {hinge; flexible joint}

    {doorlock}

    {armrest}

    hyperonym

    hyperonym

    hyperonym

    hyperonymhyperonym

    meronym

    meronym

    meronym

    meronym

    Homework

    Select 5 most frequent noun terms, findhomonyms, synonyms, hypernmys andholonyms of the terms. May use Wordnetat http://wordnet.princeton.edu/. SelectUse Wordnet Online.

    Create the noun ontology.

    http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/
  • 8/2/2019 Text Retrieval

    21/29

    IR models

    Overview

    Boolean Retrieval

    Fuzzy Retrieval

    Vector Space Retrieval

    Probabilistic Retrieval (BIR Model)

    Latent Semantic Indexing

    Boolean search

  • 8/2/2019 Text Retrieval

    22/29

    Boolean model Historically:

    Documents were stored on tapes or punched cards.

    Searching: only sequential access.

    Today:

    Boolean search is still very frequent but is not state-of-the-art.. Google uses it for simplicity but furtehr improved it byadditionally sorting/ranking results sets.

    Model:

    Document D represented by binary vector dwith di=1 if termtioccurs in document i.

    Query q comes from query space Q; let tbe an arbitrary term,

    and q1 and q2 be queries from Q; Q is given by queries oftype:

    t,q1 ^ q2, , q1 v q2, q1

    Boolean model (cont.)

  • 8/2/2019 Text Retrieval

    23/29

    Term-document matrix

    Query: Brutus AND Caesar AND NOT Calpurnia

    Take the vectors for Brutus, Caesar and Calpurnia, complement the last,and then do a bitwise AND:

    110100 AND 110111 AND 101111 = 100100

    Boolean retrieval

    Query: Brutus AND Caesar AND NOT Calpurnia

  • 8/2/2019 Text Retrieval

    24/29

    Fuzzy retrieval

    Fuzzy retrieval (cont.)

  • 8/2/2019 Text Retrieval

    25/29

    Vector-space model Since Boolean models binary weights too limiting, vector

    supports partial matching.

    Non-binary weights are assigned to index terms in queriesand documents.

    Term weights are used to compute degree of similaritybetween documents in the database and the users query.

    term3 = malam

    term2 = ibadah

    term1 = solat

    q d

    Vector-space model (cont.)

    The tf metric is considered an indication of how well a term characterizes the content of a

    document. The idf, in turn, reflects the number of documents in the collection in which

    the term occurs, irrespective of the number of times it occurs in those documents.

  • 8/2/2019 Text Retrieval

    26/29

    Inverse document frequency

    Document-Term-Matrix

  • 8/2/2019 Text Retrieval

    27/29

    Vector-space model (cont.)

    Example

    N = #of documents

    M= # of terms

  • 8/2/2019 Text Retrieval

    28/29

    a arrived

    gold silver truck

    Class exercises

    Using selected, most frequent 10 terms inyour story, create term-document matrixfor boolean model and vector model.

  • 8/2/2019 Text Retrieval

    29/29

    Remarks There are many more methods to determine the vector

    representations and to compute retrieval status values Main assumption of vector space retrieval Terms occur independent from each other in documents Not true: if one writes about Mercedes, the term "car" is likely to co-

    occur in document

    Advantages: Simple model with efficient evaluation algorithms Partial match queries possible, i.e., it returns documents that only

    partly contain the query terms (similar to or-operator of Booleanretrieval)

    Very good retrieval quality; but not state-of-the-art Relevance feedback may further improve vector space retrieval

    Disadvantages: Many heuristics and simplification; no proof for "correctness" of result

    set HTML/Web: occurrences of terms is not the most important criteria to

    rank documents (spamming)