Mutilingual Iformation Retrieval

Embed Size (px)

Citation preview

  • 7/31/2019 Mutilingual Iformation Retrieval

    1/15

    Multilingual Information

    Retrieval

    by

    T.Mehbub Basha

  • 7/31/2019 Mutilingual Iformation Retrieval

    2/15

    OverviewIntroduction

    Document Preprocessing

    Monolingual Information Retrieval

  • 7/31/2019 Mutilingual Iformation Retrieval

    3/15

    Introduction

    Concerned with satisfying information needs of

    users. Ex: documents

    World Wide Web(WWW) websites requiresefficient approaches to retrieve relevant subsets

    for specific information needs.

    Constantly increasing number of information

    items, it requires to adapt the retrieval techniques

    applied to Web search to these new scenarios.

  • 7/31/2019 Mutilingual Iformation Retrieval

    4/15

    Why we need multilingual?

  • 7/31/2019 Mutilingual Iformation Retrieval

    5/15

    Websites, social networks orpersonal emails

    are written in different languages(27.3% of

    English Internet users , last accessed November

    16, 2010)

    People from different nations and languages areconnected in social networks

    Internet usage statistics as presented in Figure

    1.1 show that only one fourth of the Internet users

    are native English speakers.

  • 7/31/2019 Mutilingual Iformation Retrieval

    6/15

    Figure 1.1: Statistics of the number of Internet users by language

  • 7/31/2019 Mutilingual Iformation Retrieval

    7/15

    Cont..

    Many information retrieval approaches are based

    on Machine Translation (MT) systems. However,

    these systems still have high error rates(like

    grammars, meanings)

    This motivates the development ofmultilingual

    retrieval methods that do not depend on MT or at

    least are able to compensate errors introduced

    by the translation systems.

  • 7/31/2019 Mutilingual Iformation Retrieval

    8/15

    DEFINITION OF INFORMATION RETRIEVAL:

    Given a collection D containing information items

    di and a keyword query q representing an

    information need, IR is defined as the task of

    retrieving a ranked list of information items d1,

    d2, . . . sorted by their relevance in respect to the

    specified information need.

    The overall search process is visualized in Figure

    II.1 . This process consists of two parts.

    1.Indexing part

    2.Search part

  • 7/31/2019 Mutilingual Iformation Retrieval

    9/15

  • 7/31/2019 Mutilingual Iformation Retrieval

    10/15

    The indexing part processes the entire document

    collection to built index structures & Eachdocument is thereby preprocessed and mapped

    to a vector representation

    The search part is based on the same

    preprocessing step that is also applied to the

    query. Using the vector representation of the

    query, the matching algorithm determines

    relevant documents which are then returned as

    ranked results.

  • 7/31/2019 Mutilingual Iformation Retrieval

    11/15

    monolingual case, the content of information

    items di and the keyword query q are thereby

    written in the same language.

    Cross-lingual and Multilingual IR, the information

    need and the corresponding query of the usermay be formulated in other languages than the

    one in which the documents are written in.

  • 7/31/2019 Mutilingual Iformation Retrieval

    12/15

    Introduction

    Document Preprocessing

    Monolingual Information Retrieval

  • 7/31/2019 Mutilingual Iformation Retrieval

    13/15

    Preprocessing takes a set ofraw documents as

    input and produces as set oftokens as output.

    Depending on language, script and other factors,

    the process for identifying terms can differ

    substantially

    For Western European languages, terms used in

    IR systems are often defined by the words of

    these languages. But forChinese, words are not

    separated by whitespaces. So use character

    sequences ,avoids the problem of detecting word

    borders

  • 7/31/2019 Mutilingual Iformation Retrieval

    14/15

    Common techniques used for document preprocessing

    document syntax, encoding, tokenization & normalization of

    tokens

  • 7/31/2019 Mutilingual Iformation Retrieval

    15/15