Multimedia IR - Ppts 50

Embed Size (px)

Citation preview

  • 8/12/2019 Multimedia IR - Ppts 50

    1/50

    E.G.M Petrakis Multimedia InformationRetrieval

    1

    Information Retrieval (IR)

    Deals with the representation, storageand retrieval of unstructured dataTopics of interest: systems, languages,

    retrieval, user interfaces, datavisualization, distributed data setsClassical IR deals mainly with text

    The evolution of multimedia databasesand of the web have given new interestto IR

  • 8/12/2019 Multimedia IR - Ppts 50

    2/50

    E.G.M Petrakis Multimedia InformationRetrieval

    2

    Multimedia IR

    A multimedia information system thatcan store and retrieveattributes

    text

    2D grey-scale and color images

    1D time series

    digitized voice or musicvideo

  • 8/12/2019 Multimedia IR - Ppts 50

    3/50

    E.G.M Petrakis Multimedia InformationRetrieval

    3

    Applications

    Financial, marketing:stock prices, sales etc.find companies whose stock prices move similarly

    Scientific databases:sensor data

    whether, geological, environmental dataOffice automation, electronic encyclopedias,

    electronic booksMedical databasesX-rays, CT, MRI scans

    Criminal investigationsuspects, fingerprints,Personal archives, text andcolor images

  • 8/12/2019 Multimedia IR - Ppts 50

    4/50

    E.G.M Petrakis Multimedia InformationRetrieval 4

    Queries

    Formulation of users information needFree-text query

    By example document (e.g., text, image)

    e.g.,in a collection of color photos findthose showing a tree close to a house

    Retrievalis based on the understanding

    of the contentof documents and oftheir components

  • 8/12/2019 Multimedia IR - Ppts 50

    5/50

    E.G.M Petrakis Multimedia InformationRetrieval 5

    Goals of Retrieval

    Accuracy: retrieve documents that theuser expects in the answerWith as few incorrect answers as possible

    All relevant answers are retrieved

    Speed: retrievals has to be fastThe system responds in real time

  • 8/12/2019 Multimedia IR - Ppts 50

    6/50

    Accuracy of Retrieval

    Depends on query criteria, querycomplexity and specificityAttributes / Content criteria?

    Depends on what is matched with thequeryHow documents content is represented

    Typically feature vectors

    Depends also on matching function Euclidean distance

    E.G.M Petrakis Multimedia InformationRetrieval 6

  • 8/12/2019 Multimedia IR - Ppts 50

    7/50

    E.G.M Petrakis Multimedia InformationRetrieval 7

    Speed of Retrieval

    The query is compared (matched) with allstored documents

    The definition of similarity criteria (similarity

    or distance function between documents) is animportant issue:Similarity or distance function between documentsMatching has to fast: document matching has to be

    computationally efficient and sequential searchingmust be avoided

    Indexing: search documents that are likely tomatch the query

  • 8/12/2019 Multimedia IR - Ppts 50

    8/50

    Doc1-FeactureVector

    Doc2-FeactureVector

    DocN-FeatureVector

    Database

    E.G.M Petrakis Multimedia InformationRetrieval 8

    database

    descriptions documents

    Doc1

    Doc2

    DocN

    index

    query

  • 8/12/2019 Multimedia IR - Ppts 50

    9/50

    E.G.M Petrakis Multimedia InformationRetrieval 9

    Main Idea

    Descriptionsare extracted and storedSame or separate storage with documentsQueries address the stored descriptions

    rather the documents themselvesImages & video: the majority of stored

    data

    Solution: retrieval by textText retrievalis a well researched areaMost systems work with text, attributes

  • 8/12/2019 Multimedia IR - Ppts 50

    10/50

    E.G.M Petrakis Multimedia InformationRetrieval 10

    Problem

    Text descriptions for images and videocontained in documents are not availablee.g., the text in a web site is not always

    descriptive of every particular imagecontained in the web site

    Two approaches

    human annotationsfeature extraction

  • 8/12/2019 Multimedia IR - Ppts 50

    11/50

    E.G.M Petrakis Multimedia InformationRetrieval 11

    Human Annotations

    Humans insert attributes, captions forimages and videos inconsistent or subjectiveover time and

    users (different users do not give thesame descriptions)

    retrievals will failif queries are

    formulated using different keywords ordescriptions

    time consuming process, expensiveorimpossible for large databases

  • 8/12/2019 Multimedia IR - Ppts 50

    12/50

    E.G.M Petrakis Multimedia InformationRetrieval 12

    Feature Extraction

    Features are extracted from audio,image and video cheaper and replicable approach

    consistent descriptions but

    inexact

    low level feature (patterns, colors etc.)

    difficult to extract meaning different techniques for different data

  • 8/12/2019 Multimedia IR - Ppts 50

    13/50

    E.G.M Petrakis Multimedia InformationRetrieval 13

    Furht at.al. 96

    Architecture of a IRSystem for Images

  • 8/12/2019 Multimedia IR - Ppts 50

    14/50

    E.G.M Petrakis Multimedia InformationRetrieval 14

    Multimedia IR in Practice

    Relies on text descriptionsNon-text IRthere is nothing similar to text-based

    retrieval systemsautomatic IRis sometimes impossiblerequires human intervention at some level

    significant progress have been madedomain specific interpretations

  • 8/12/2019 Multimedia IR - Ppts 50

    15/50

    E.G.M Petrakis Multimedia InformationRetrieval 15

    Ranking of IR Methods

    Ranking in terms of complexity and accuracyattributestext

    audioimagevideo

    CombinedIRbased on two or more data types

    for more accurate retrievalsComplex data types (e.g.,video) are rich

    sources of information

    accuracy complexity

  • 8/12/2019 Multimedia IR - Ppts 50

    16/50

    E.G.M Petrakis Multimedia InformationRetrieval 16

    Similarity Searching

    IRis not an exact processtwo documents are never identical!

    they can be similar

    searching must be approximateThe effectiveness of IRdepends on thetypes and correctness of descriptions used

    types of queries allowed

    user uncertainly as to what he is looking for

    efficiency of search techniques

  • 8/12/2019 Multimedia IR - Ppts 50

    17/50

    E.G.M Petrakis Multimedia InformationRetrieval 17

    Approximate IR

    A query is specified and all documentsup to a pre-specified degree ofsimilarity are retrieved and presentedto the user ordered by similarityTwo common types of similarity queries:range queries: retrieve all documents up to

    a distance threshold Tnearest-neighbor queries: retrieve the k

    best matches

  • 8/12/2019 Multimedia IR - Ppts 50

    18/50

    E.G.M Petrakis Multimedia InformationRetrieval 18

    Distance - Similarity

    Decide whether two documents aresimilardistance: the lower the distance, the more

    similar the documents aresimilarity: the higher the similarity, the

    more similar the documents are

    Key issue for successful retrieval: themore accurate the descriptions are, themore accurate the retrieval is

  • 8/12/2019 Multimedia IR - Ppts 50

    19/50

    E.G.M Petrakis Multimedia InformationRetrieval 19

    Retrieval Quality Criteria

    Retrieval with as few errors as possibleTwo types of errors:false dismissals or misses: qualifying but

    non retrieved documentsfalse positives or false drops: retrieved

    but not qualifying documents

    A good method minimizes bothRanking quality: retrieve qualifyingdocuments before non-qualifying ones

  • 8/12/2019 Multimedia IR - Ppts 50

    20/50

    E.G.M Petrakis Multimedia InformationRetrieval 20

    Evaluation of Retrieval

    Given a collection of documents, a set ofqueries and human experts responses to theabove queries, the ideal system will retrieve

    exactly what the human dictatedThe deviations from the above are measured

    collectiontheinrelevantornumbertotal

    answerinrelevantretrievedornumber=

    retrievedofnumber

    answerinrelevantretrievedofnumber=

    recall

    precision

  • 8/12/2019 Multimedia IR - Ppts 50

    21/50

    E.G.M Petrakis Multimedia InformationRetrieval 21

    Precision-Recall Diagram

    Measure precision-recall for 1, 2 ..N answers

    High precision means few false alarms

    High recall mean few false dismissals

    precision

    recall

    1

    1

    0,5

    0,5

    the ideal method

  • 8/12/2019 Multimedia IR - Ppts 50

    22/50

    E.G.M Petrakis Multimedia InformationRetrieval 22

    Harmonic Mean

    A single measure combining precision and recall

    r: recall,p: precision Ftakes values in [0,1] F 1as more retrieved documents are relevant F0as few retrieved documents are relevant Fis high when both precision and recall are high F expresses a compromise between precision and

    recall

    pr

    F11

    +

    2=

  • 8/12/2019 Multimedia IR - Ppts 50

    23/50

    E.G.M Petrakis Multimedia InformationRetrieval 23

    Ranking Quality

    The higher the Rnormthe better the ability ofa method to retrieve correct answer beforeincorrect

    A human expert evaluates the answer setThen take answers in pairs: (relevant,irrelevant)S+: pairs ranked correctly (the relevantentry

    has retrieved before the irrelevantone)S-: pairs ranked incorrectlySmax+: total ranked pairs

    otherwise

    SifR S

    SS

    norm

    1

    0)1( max-

    21

    max

    -

  • 8/12/2019 Multimedia IR - Ppts 50

    24/50

    E.G.M Petrakis Multimedia InformationRetrieval 24

    Searching Method

    Sequential scanning: the query ismatched with all stored documentsslow if the database is large or

    if matching is slowThe documents must be indexedhashing, inverted files, B-trees, R-trees

    different data types are indexed andsearched separately

  • 8/12/2019 Multimedia IR - Ppts 50

    25/50

    E.G.M Petrakis Multimedia InformationRetrieval 25

    Goals of IR

    Summarizing, there are two generalgoals common to all IRsystemseffectiveness: IRmust be accurate

    (retrieves what the user expects to see inthe answer)

    efficiency: IRmust be fast(faster than

    sequential scanning)

  • 8/12/2019 Multimedia IR - Ppts 50

    26/50

    E.G.M Petrakis Multimedia InformationRetrieval 26

    Approximate IR

    A query is specified and all documentsup to a pre-specified degree ofsimilarity are retrieved and presented

    to the user ordered by similarityTwo common types of similarity queries:range queries: retrieve all documents up to

    a distance threshold Tnearest-neighbor queries: retrieve the k

    best matches

  • 8/12/2019 Multimedia IR - Ppts 50

    27/50

    E.G.M Petrakis Multimedia InformationRetrieval 27

    Query Formulation

    Must be flexible and convenient

    SQLqueries constraints on allattributes and data types may becomevery complex

    Queries by examplee.g., by providing anexample document or image

    Browsing: display headers, summaries,miniatures etc. or for refining theretrieved results

  • 8/12/2019 Multimedia IR - Ppts 50

    28/50

    E.G.M Petrakis Multimedia InformationRetrieval 28

    Access Methods for Text

    Access methods for text areinteresting for at least 3 reasonsmultimedia documents contain text, e.g.,

    images often have captionstext retrieval has several applications in

    itself (library automation, web search etc.)text retrieval research has led to useful

    ideas like, vector space model, informationfiltering, relevance feedback etc.

  • 8/12/2019 Multimedia IR - Ppts 50

    29/50

  • 8/12/2019 Multimedia IR - Ppts 50

    30/50

    E.G.M Petrakis Multimedia InformationRetrieval 30

    Keyword Matching

    The whole collection is searchedno preprocessing

    no space overhead

    updates are easy

    The user specifies a string (regularexpression) and the text is parsed usinga finite state automatonKMP, BMH algorithms

  • 8/12/2019 Multimedia IR - Ppts 50

    31/50

    E.G.M Petrakis Multimedia InformationRetrieval 31

    Error Tolerant Methods

    Methods that can tolerate errors

    Scan text one character at a time bykeeping track of matched charactersand retrieve strings within a desiredediting distance from queryWu, Manber 92, Baeza-Yates, Connet 92

    Extension: regular expressions built-upby strings and operators: pro(blem|tein) (s| ) | (0| 1| 2)

  • 8/12/2019 Multimedia IR - Ppts 50

    32/50

    E.G.M Petrakis Multimedia InformationRetrieval 32

    Error Counting Methods

    Retrieve similar words or phrases e.g.,misspelled words or words withdifferent pronunciation

    editing distance: a numerical estimateof the similarity between 2 strings

    phonetic coding: search words withsimilar pronunciation (Soundex, Phonix)N-grams: count common N-length

    substrings

  • 8/12/2019 Multimedia IR - Ppts 50

    33/50

    E.G.M Petrakis Multimedia InformationRetrieval 33

    Editing Distance

    Minimum number of edit operationsthatare needed to transform the firststring to the second

    edit operation: insert, delete, substituteand (sometimes) transposition ofcharacters

    d(si,ti): distance between charactersusually d(si,ti)= 1if si < >tiand0otherwiseedit(cordis,codis) = 1: ris deletededit(cordis, codris) = 1: transposition of rd

  • 8/12/2019 Multimedia IR - Ppts 50

    34/50

    E.G.M. Petrakis Multimedia InformationRetrieval 34

    i 0 1 2 3 4 5 6

    j a b b c d d

    0 0 1 2 3 4 5 6

    1 a 1 0 1 2 3 4 5

    2 a 2 1 1 2 3 4 5

    3 a 3 2 2 2 3 4 5

    4 b 4 3 2 2 3 4 5

    5 c 5 4 3 2 2 3 4

    6 c 6 5 4 4 3 3 4

    7 c 7 6 5 5 4 4 4

    8 d 8 7 6 6 5 4 4

    initialization cost

    total cost

  • 8/12/2019 Multimedia IR - Ppts 50

    35/50

  • 8/12/2019 Multimedia IR - Ppts 50

    36/50

    E.G.M. Petrakis Multimedia InformationRetrieval 36

    Matching Algorithm

    Compute D(A,B)#A, #Blengths of A, B 0: null symbol R: cost of an edit operationD(0,0) = 0

    for i = 0to #A: D(i:0) = D(i-1,0) + R(A[i]0);forj = 0to #B: D(0:j) = D(0,j-1) + R(0B[j]);for i = 0to #A

    forj = 0to #B{1. m1= D(i,j-1) + R(0B[j]);2. m

    2= D(i-1,j) + R(A[i]0);

    3. m3 = D(i-1,j-1) + R(A[i]B[j]);4. D(i,j) = min{m1, m2, m3};

    }

  • 8/12/2019 Multimedia IR - Ppts 50

    37/50

    E.G.M Petrakis Multimedia InformationRetrieval 37

    Classical IRModels

    Boolean modelsimple based on set theory

    queries as Boolean expressions

    Vector space modelqueries and documents as vectors in term

    space

    Probabilistic modela probabilistic approach

  • 8/12/2019 Multimedia IR - Ppts 50

    38/50

    E.G.M Petrakis Multimedia InformationRetrieval 38

    Text Indexing

    A document is represented by a set ofindex termsthat summarize document

    contents

    adjectives, adverbs, connectives are lessuseful

    mainly nouns (lexicon look-up)

    requires text pre-processing (off-line)

  • 8/12/2019 Multimedia IR - Ppts 50

    39/50

    E.G.M Petrakis Multimedia InformationRetrieval 39

    Indexing

    index

    Data Repository

  • 8/12/2019 Multimedia IR - Ppts 50

    40/50

    E.G.M Petrakis Multimedia InformationRetrieval 40

    Text Preprocessing

    Extract index terms from textProcessing stagesword separation, sentence splitting

    change terms to a standard form (e.g.,lowercase)eliminate stop-words (e.g. and, is, the, )

    reduce terms to their base form (e.g.,eliminate prefixes, suffixes)construct mapping between terms and

    documents (indexing)

  • 8/12/2019 Multimedia IR - Ppts 50

    41/50

    E.G.M Petrakis Multimedia InformationRetrieval 41

    Text Preprocessing Chart

    from Baeza Yates & Ribeiro Neto, 1999

  • 8/12/2019 Multimedia IR - Ppts 50

    42/50

  • 8/12/2019 Multimedia IR - Ppts 50

    43/50

    E.G.M Petrakis Multimedia InformationRetrieval 43

    Inverted Files

    Dictionary: terms stored alphabeticallyPosting list: pointers to documents

    containing the term

    Posting info: document id, frequency ofoccurrence, location in document, etc.dictionary indexed by B-trees, tries or

    binary searchedPros: fast and easy to implementCons: large space overhead (up to 300%)

  • 8/12/2019 Multimedia IR - Ppts 50

    44/50

    E.G.M Petrakis Multimedia InformationRetrieval 44

    Inverted Index

    index posting list

    (1,2)(3,4)

    (4,3)(7,5)

    (10,3)

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    documents

  • 8/12/2019 Multimedia IR - Ppts 50

    45/50

    E.G.M Petrakis Multimedia InformationRetrieval 45

    Query Processing

    1. Parse query and extract query terms2. Dictionary lookup binary search

    3. Get postings from inverted file one term at a time4. Accumulate postings record matching document ids, frequencies, etc.

    compute weights5. Rank answers by weights (e.g., frequencies)

  • 8/12/2019 Multimedia IR - Ppts 50

    46/50

    E.G.M Petrakis Multimedia InformationRetrieval 46

    Signature Files

    A filter for eliminating most of the non-qualifying documentssignature: short hash-coded representation

    of document or querythe signatures are stored and searched

    sequentially

    search returns all qualifying documents plussome false alarms

    the answer is searched to eliminate falsealarms

  • 8/12/2019 Multimedia IR - Ppts 50

    47/50

    E.G.M Petrakis Multimedia InformationRetrieval 47

    Superimposed Coding

    Each word yields a bit pattern of size Fwith mbits set to 1and the rest left as 0size of Faffects the false drop rate

    bit patterns are OR-ed to form the doc signature

    find documents having 1in the same bits as thequery

    word signature

    data 001 000 110 010base 000 010 101 001

    document signature 001 010 111 011

  • 8/12/2019 Multimedia IR - Ppts 50

    48/50

    E.G.M Petrakis Multimedia InformationRetrieval

    48

    Bitmaps

    For every term, a bit vector is storedeach bit corresponds to a document

    set to 1if term appears in document, 0

    otherwisee.g.,word 10100: the word appears in

    documents 1and 3in a collection of 5

    documentsVery efficient for Boolean queriesretrieve bit-vectors for each query term

    combine bit vectors with Boolean operators

    C i f T t I d i

  • 8/12/2019 Multimedia IR - Ppts 50

    49/50

    E.G.M Petrakis Multimedia InformationRetrieval

    49

    Comparison of Text IndexingMethods

    Bitmaps:pros: intuitive, easy to implement and to process

    cons: space overhead

    Signature files:pros: easy to implement, compact index, no lexicon

    cons: many unnecessary accesses to documents

    Inverted files:pros: used by most search engines

    cons: large space overhead, index can becompressed

  • 8/12/2019 Multimedia IR - Ppts 50

    50/50

    E G M Petrakis Multimedia Information 50

    References

    Searching Multimedia Databases by Content, C.Faloutsos, Kluwer Academic Publishers, 1996

    Modern Information Retrieval, R. Baeza-Yates, B.Ribeiro-Neto, Addison Wesley, 1999

    Automatic Text Processing, Gerard Salton, AddisonWesley, 1989 Information Retrieval Links:

    http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html

    http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.htmlhttp://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html