Information Retrieval - IR Information retrieval Information retrieval is about finding documents relevant

Embed Size (px)

Text of Information Retrieval - IR Information retrieval Information retrieval is about finding documents...

  • Informationssökning Information Retrieval - IR

    Karin Friberg Heppin Språkdata

    Institutionen för svenska språket

    Göteborgs Universitet

  • 2

    Example

    Google

    Web

  • 3

    Retriving documents

    • Goal = find documents large document collection relevant to an information need

    Document collection

    Information need

    Query

    Answer list

    IR system Retrieval

  • General definition

    • Retrieval of unstructured data

    Usually it is • Retrieval of text documents in big collections • Web Search But it can also be • Image retrieval • Video retrieval • Music retrieval ….

    What is Information Retrieval?

  • IR Research ranges from…

    computer science - researching the

    frame, the modeling of relations

    between documents and queries …

    … to information science -

    investigating the search behavior and

    preferences of the users

    … over linguistics - optimizing the

    language of document representation

    and queries …

  • Information retrieval

    Information retrieval is about finding documents relevant to an information need, which are stored and indexed.

    This is done by posing a query to a search engine

    which matches the terms used as search keys to the terms used to store the documents in the index.

  • Informationssökning

    Dokumenten lagras i index

    Formulera sökfrågor - formella representationer av informationsbehov

    Ord i sökfrågor matchas mot ord i dokumenten (indexet)

    Utdata – en lista med dokument

    Ju bättre matchning mellan sökfrågan och ett dokument, desto högre rankas dokumentet

  • The goal of Information retrieval

    •Separate relevant from non-relevant documents

    •Rank relevant documents above non-relevant ones

    •Rank highly relevant documents above less relevant

    documents

  • Relevance

    The relevance of a document, is a measure of how well the document satisfies the user's information need - A relevant document contains the information that a

    person was looking for when they submitted a query to the search engine

    - Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

    - Topical relevance (same topic) vs. user relevance (what is useful for the user)

    - Relevance may be binary: relevant/irrelevant or have a multilevel scale: e.g. 0-3

  • What IR is usually not about

    Not about structured data

    Retrieval from databases is usually not considered

    • Queries in databases assumes that the data is in a standardized format

    Database queries have a right answer

    - How much money did you make last year?

    IR problems usually don’t

    - Find all documents relevant to “hippos in a zoo”

  • Databases IR

    What we’re retrieving Structured data. Clear semantics based on a formal model.

    Mostly unstructured. Free text with some metadata.

    Queries we’re posing Formally (mathematically) defined queries. Unambiguous.

    Vague, imprecise information needs (often expressed in natural language).

    Results we get Exact. Always correct in a formal sense.

    Sometimes relevant, often not.

    Databases vs. IR

  • 12

    What to study in IR

    • Document and query indexing

    • How to best represent their contents?

    • Query evaluation (or retrieval process)

    • To what extent is a document relevant to a query?

    • System evaluation

    • How good is a system?

    • Are the retrieved documents relevant? (precision)

    • Are all the relevant documents retrieved? (recall)

  • 13

    How do we find the documents?

    1. String matching (linear search in documents)

    - Slow

    - Difficult to improve

    - Like browsing through a whole book

    2. Indexing

    - Fast

    - Flexible to further improvement

    - Like consulting the index at the back of the book

    Possible approaches:

  • Word-Level Issues

    • Morphological variation

    = different forms of the same concept

    – Inflectional morphology: same part of speech

    – Derivational morphology: different parts of speech

    • Synonymy

    = different words, same meaning

    • Polysemy

    = same word, different meanings

    {dog, canine, doggy, puppy, etc.}  concept of dog

    Bank: financial institution or side of a river?

    Crane: bird or construction equipment?

    Is: depends on what the meaning of “is” is!

    break, broke, broken; sing, sang, sung; etc.

    destroy, destruction; invent, invention, reinvention; etc.

  • Textoperationer

    Tokenisering – omvandla en följd av tecken till en följd av ord

    Ta bort skiljetecken och siffror

    Stoppordlista – ta bort vanliga funktionsord

    Hitta flerordsfraser som bör betraktas som ett ord

    Stemming/lemmatisering – föra samman olika böjningsformer och avledningar till en stam eller grundform

  • Tokenization

    Tokenization: separating text into terms

    Word separators: Space - most common separator

    But it is not so easy

  • What is a word / token?

    I'll send you Luca's book

    IBM360, IBM-360, ibm 360

    Richard Brown

    brown paint

    flowerpot

    flower-pot

    flower pot

  • Capitalized words can have different meaning

    from lower case words

    Bush, Apple

    But they can also represent the same word

    Horse horse HORSE

    Apostrophes can be a part of a word, a part of

    a possessive, or just a mistake

    rosie o'donnell, can't, don't, 80's, 1890's

    men's straw hats, master's degree

    england's ten largest cities, shriner's

    Tokenizing Problems

  • Numbers are vague

    1992 - year or number? Numbers can be important, including decimals: nokia3250, top 10 courses united 93, B-52, 7-eleven Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

    Tokenizing Problems

  • Small words can be important in some queries,

    usually in combinations xp, ma, pm, el paso, pm,

    j lo, world war II

    Both hyphenated and non-hyphenated forms of

    many words are common

    Sometimes hyphen is not needed e-bay, wal-mart,

    active-x, cd-rom, t-shirts

    At other times, hyphens should be considered

    either as part of the word or a word separator

    Winston-salem, mazdarx-7, e-cards, pre-diabetes,

    t-mobile, spanish-speaking

    Tokenizing Problems

  • Many morphological variations of words

    In most cases, these have the same or very

    similar meanings

    Stemmers attempt to reduce morphological

    variations of words to a common stem

    Usually involves removing suffixes

    The alternative is to use several variations

    of the words in the queries

    Stemming/lemmatization

  • Lemmatization:

    Dictionary-based - uses lists of related words

    Produces morphologically valid units

    Stemming:

    Algorithmic: uses program to determine

    related words

    The product is not neccessarily

    morphologically valid units

    Stemming vs lemmatization

  • Over-stemming/under-stemming

    Errors of comission:

    doe/doing

    execute/executive

    ignore/ignorant

    Errors of omission:

    create/creation

    europe/european

    cylinder/cylindrical

    Incorrectly lumps unrelated

    terms together

    Fails to lump related

    terms together

    Over-stemming:

    Under-stemming:

  • Generally a small but significant effectiveness

    improvement

    Crucial for some languages:

    5-10% improvement for English, up to 50% in

    Arabic

    Effect of stemming/lemmatization

    Example of mean relative improvement due to stemming: +4% with the English language +4% Dutch +7% Spanish +9% French +15% Italian +19% German +29% Swedish +34% Bulgarian +40% Finnish +44% Czech

  • Stemmer Comparison

  • Undesirable tokens can be eliminated:

    Non-content bearing tokens

    Special characters

    Numbers: dates, amounts…

    Very short or very long tokens, ...

    Full text index or not

    Not as important anymore as storage space is cheap

  • Removed:

    - Function words, determiners,

    prepositions, which have little meaning on

    their own

    - High occurrence frequencies

    Reason:

    - To reduce index space, improve

    response time, improve effectiveness