Download pdf - Information Retrieval and Evaluation - KTI – Knowledge …kti.tugraz.at/staff/rkern/courses/kddm2/ir.pdf · 2019. 4. 4. · InformationRetrievalandEvaluation KnowledgeDiscoveryandDataMining2(VU)(707.004)

Information Retrieval and EvaluationKnowledge Discovery and Data Mining 2 (VU) (707.004)

Heimo Gursch, Roman Kern

ISDS, TU Graz

2019-04-04

Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 1 / 66

Outline1 Basics

FoundationsInverted IndexText Preprocessing

2 Advanced ConceptsRelevance FeedbackQuery ProcessingLearning-To-RankProbabilistic IRLatent Semantic Indexing

3 Evaluation4 Applications & Solutions

Distributed & Federated SearchContent-based Recommender systemsOpen Source SolutionsCommercial Solutions


BasicsInverted Index & Text-Preprocessing


Foundations

Definition

Information Retrieval (IR) is findingmaterial (usually documents) of anunstructured nature (usually text) that satisfies an information need fromwithin large collections (usually stored on computers).

The term “term” is very common in Information Retrieval and typically refers to single words (alsocommon: bi-grams, phrases, ...)

Basic assumptions of Information RetrievalCollection Fixed set of documents

Goal Retrieve documents with information that is relevant touser’s information need and help her complete a task.


Structured, Semi-Structured Data

Structured dataFormat and type of information is knownIR task: normalize representations between different sourcesExample: SQL databases

Semi-structured dataCombination of structured and unstructured dataIR task: extract information from the unstructured part and combine itwith the structured informationExample: email, document with meta-data


Unstructured data

Unstructured dataFormat of information is not know, information is hidden in(human-readable) textIR task: extract information at allExample: blog-posts, Wikipedia articles

This lecture focuses on unstructured data.


Overview Indexed Searching

Split processIndexing Convert all input documents into a data-structure suitable for

fast searchingSearching Take an input query andmap it onto the pre-processed

data-structure to retrieve the matching documents

Inverted indexReverses the relationship between documents and contained terms(words)Central data-structure of a search engine


Information Stored in an Inverted Index

DictionarySorted List of all terms found in all documents

PostingsInformation about the occurrence of a term, consisting of:Document (DocID) Document identifier, i. e. , in which document the term

was foundPosition(s)(Pos) Location(s) of the term in den original documentTerm Frequency (TF) How often the term occurs in the documenta

aThe term frequencies could also be calculated by counting the positionentries. As computing time is more precious thanmemory, the term frequenciesare stored as well.


Example

Document 1Sepp was stopping off in Hawaii. Sepp is a co-worker of Heidi.

Document 2Heidi stopped going to Hawaii. Heidi is a co-worker of Sepp.

Dictionary 7→ Postings

co-worker 7→ (Doc:1; TF:1; Pos:10); (Doc:2; TF:1; Pos:9)going 7→ (Doc:2; TF:1; Pos:3)Hawaii 7→ (Doc:1; TF:1; Pos:6); (Doc:2; TF:1; Pos:5)Heidi 7→ (Doc:1; TF:1; Pos:12); (Doc:2; TF:2; Pos:1,6)Sepp 7→ (Doc:1; TF:2; Pos:1,7); (Doc:2; TF:1; Pos:11)stopped 7→ (Doc:2; TF:1; Pos:2)stopping off 7→ (Doc1; TF:1; Pos 3)


Properties

PropertiesSearch for termsFormulate boolean queriesSearch for phrases (i. e. , terms in a particular order)Information where the term occurred in the documentPossible creation of snippetsDocuments can be ranked by the number of terms they contain

Keep in mind that an inverted index... takes up additional space, not only storing the original documents,but also the inverted index... must be kept up to date


Text Preprocessing

Tomake a text indexable, a couple of steps are necessary:1 Detect the language of the text2 Detect sentence boundaries3 Detect term (word) boundaries4 Stemming and normalization5 Remove stop words


Language Detection

Letter Frequencies

Percentage of occurrenceLetter English German

A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35

N-Gram FrequenciesBetter–more elaborate–methods use statistics over more than one letter,e. g. statistics over two, three or evenmore consecutive letters.


Sentence Detection

First approachEvery period marks the end of a sentence

ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.

Utilize other featuresIs the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...


Tokenization – Problem

GoalSplit sentences into tokensThrow away punctuations

Possible Pit FallsWhite spaces are no safe delimiter for word boundaries(e. g. New York)Non-alphanumerical characters can separate words, but need not(e. g. co-worker)Different forms (e. g.white space vs. whitespace)German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)


Tokenization – Solutions

Supervised LearningTrain a model with annotated training data, use the trainedmodel totokenize unknown textHidden-Markov-Models and conditional random fields are commonlyused

Dictionary ApproachBuild a dictionary (i. e. list ) of tokensGo over the sentence and always take the longest fitting token (greedyalgorithm!)Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages


Normalization

Some definitionsToken Sequence of characters representing a useful semantic unit,

instance of a typeType Common concept for tokens, element of the vocabularyTerm Type that is stored in the dictionary of an index

Task1 Map all possible tokens to the corresponding type2 Store all representing terms of one type in the dictionary

SolutionsProvide list of different representationsProvide rules for mapping representations to main formMap different spelling version by using a phonetic algorithm


Example


Stemming & Lemmatization

DefinitionGoal of both Reduce different grammatical forms of a word to their

common infinitivea

Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).

Lemmatization Usage of proper grammatical rules to recreate theinfinitive.

aThe common infinitive is the form of a word how it is written in a dictionary.This form is called lemma, hence the name Lemmatization.

ExamplesMap do, doing, done, to common infinitive doMap going,went, gone to go


Stop Words

Definition Extremely common words that appear in nearly every textProblem As stop words are so common, their occurrence does not

characterise a textSolution Just drop them, i. e. , do not put them in the index directory

Common stop wordsWrite a list with stop words (List might be topic specific!)Usual suspects: articles (a, an, the), conjunction (and, or, but, ...), pre-and postposition (in, for, from), etc.

SolutionStop word list Ignore word that are given on a list (black list)

Problem Special names and phrases (The Who, Let It Be, ...)Solution Make another list... (white list)


Advanced ConceptsRelevance Feedback, Query Processing, ...


Relevance Feedback

Integrate feedback from the userGiven a search result for a user’s query... allow the user to rate the retrieved results (good vs. not-so-goodmatch)This information can then be used to re-rank the results to match theuser’s preferenceOften automatically inferred from the user’s behaviour using thesearch query logs... and ultimately improve the search experience

The query logs (optimally) contain the queries plus the items the user has clicked on (the user’s session)


Relevance Feedback

Pseudo relevance feedbackNo user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a goodmatchAs if the users has clicked on the first results andmarked them as goodmatch→ often improves the quality of the search results

Pseudo relevance feedback is sometimes also called blind relevance feedback


Query Reformulation

Modify the query, during of after the user interactionQuery suggestion & completion(Automatic) query correction(Automatic) query expansion


Query Suggestion

The user start typing a query... automatically gets suggestions→ to complete the currently typed word→ to complete a whole phrase (e.g. the next word)→ suggest related queries


Query Correction

The user entered a query, which may contain spelling errors... automatically correct or suggest possible rectified queries... also known as Query Reformulation→ use other users behaviour (harvest query logs)→ use the frequency of terms found in the indexed documents (andthe similarity with the entered words)

Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723


http://nbviewer.ipython.org/gist/MajorGressingham/7691723

Using Google as Spell Checker


Query Expansion

The user has entered a query... automatically add (related) words to the query→ Global query expansion - only use the query plus corpus orbackground knowledge→ Local query expansion - conduct the search and analyse the searchresults


Global Query Expansion

Uses only the query and background knowledgeExample: Use of a thesaurus

I Query: [buy car]I Use synonyms from the thesaurusI Expanded query: [(buy OR purchase) (car OR auto)]


Local query expansion

Uses the query plus the search results... conceptually similar to the pseudo relevance feedbackLook for terms in the first n search results and add them to the queryRe-run the query with the added terms


More Like This

A common feature of search engines is themore like this searchGiven a search results... the user wants items similar to one of the presented resultsThis can be seen as: taking a document as query→which is a common implementation (but restricting the query to themost discriminative terms)


Learning-To-Rank

Motivation fromWeb-SearchRelevance might be due to a lot of different reasons

I Keywords in the text or the meta-dataI Anchor textsI PageRankI ... many other

Basic idea:I Use evidence (from users)I ... which source of informationI ... is the most beneficialI ... by the use of Machine Learning


Learning-To-Rank

Supervised Learning - SettingNumber of queries -Q = {q1, q2, . . . , qm}Set of documents -D = {d1, d2, . . . , dN}

I Each queries has labelled relevant documentsI With labelsY = {1, 2, . . . , l}I being ranked, i. e. l � l − 1 � · · · � 1I Relevant documents for query qi: Di = {di,1, di,2, . . . , di,ni}I Labels for the query qi: yyy i = {yi,1, yi,2, . . . , yi,ni}

Training data set: S = {(qi,Di),yyy i}mi=1I By replacing (qi,Di) by representative features

H. Li, “A Short Introduction to Learning to Rank,” IEICE Trans. Inf. Syst., vol. E94-D, no. 1, pp.1–2, 2011.


Learning-To-Rank

Supervised Learning - Data LabellingWhere to take the labelling information from?

I Directly from human judgementsF e. g. , perfect, excellent, good, fair, bad

I Indirectly via user behaviourF e. g. , via Web logs analysis

Note: Ordinal classification (ordinal regression) vs. ranking, i. e. , ranking is about recreatingthe expected ordering


Learning-To-Rank

Supervised Learning - Main ApproachesPointwise approach

I Transform the problem in a classification, regression or ordinalclassification

I e. g. , SVM for ordinal classification, i. e. , find parallel hyperplanesPairwise approach

I Transform the problem in a pairwise classification, pairwise regressionI e. g. , Ranking SVM, i. e. , to classify the order of pairs

Listwise approachI Directly learn/optimise the rankingI e.g. SVM MAP, i.e. to optimise on a scoring function (goodness of

ranking)


Information flow

The classic search model


Probabilistic Information Retrial

Every transformation can hold a possible information loss1 Task to query2 Query to document

Information loss leads to uncertaintyProbability theory can deal with uncertainty

Basic idea behind probabilistic IRModel the uncertainty, which originates in the imperfecttransformationsUse this uncertainty as a measurement how relevant a document is,given a certain query


Probabilistic Information Retrial

Basic modelGiven a query - q

I ... and an (unknown/uncertain) set of relevant document -DI ... and irrelevant documents - ¬D

Representation of a document - diP(D|di) is the probability that di is relevant and P(¬D|di) that it isirrelevantRank documents in decreasing order of probability of relevance -P(D|q, di)


Probabilistic Information Retrial Models

Binary Independence ModelBinary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other

Note: Computing the “true” probabilities would not be feasible


Language Models

Basic IdeaTrain a probabilistic modelMd for document d, i. e. , as many trainedmodels as documentsMd ranks documents on how possible it is that the user’s query qwascreated by the document dResults are documentmodels (or documents, respectively) which havea high probability to generate the user’s query


Language Models

Alternative point of viewIntuitively, each document defines a “language”What is the probability to generate the given query, given a languagemodelWhat is the probability to generate the given document, given alanguage model


Examples for Language Models

Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . . P(tn|t1 . . . tn−1)

Unigram Languge ModelPuni(t1t2 . . . tn) = P(t1)P(t2) . . . P(tn)

Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . . P(tn|tn−1)

Probability P(tn) Probabilities are generated from the term occurrence inthe document in question

Andmanymore... A lot more, andmore complex models are available

Note: Smoothing plays an important role in languagemodels


Index Construction

Going beyond simple indexingTerm-document matrices are very largeBut the number of topics that people talk about is small (in somesense)

I Clothes, movies, politics, ...

Can we represent the term-document space by a lower dimensionallatent space?


Index Construction

Latent Semantic Indexing (LSI)is based on Latent Semantic Analysis (see KDDM1), whichis based on Singular Value Decomposition (SVD), whichcreates a new space (new orthonormal base), whichallows to bemapped to lower dimensions, whichmight improve the search performance


Singular Value Decomposition

SVD overviewFor anm× nmatrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:

I M = UΣVT

I ... U is anm×m unitary matrixI ... Σ is anm× n diagonal matrixI ... VT is an n× n unitary matrix

The columns of U are orthogonal eigenvectors of AAT

The columns of V are orthogonal eigenvectors of ATAEigenvalues λ1...λr of AAT are the eigenvalues of ATA.



Matrix DecompositionSVD enables lossy compression of a term-document matrix

I Reduces the dimensionality or the rankI Arbitrarily reduce the dimensionality by putting zeros in the bottom

right of sigmaI This is a mathematically optimal way of reducing dimensions



Reduced SVDIf we retain only k singular values, and set the rest to 0ThenΣ is k × k, U ism× k, VT is k × n, and Ak ism× nThis is referred to as the reduced SVDIt is the convenient (space-saving) and usual form for computationalapplications

Approximation ErrorHow good (bad) is this approximation?It’s the best possible, measured by the Frobenius norm of the error


Latent Semantic Indexing

Application of SVD - LSIFrom term-document matrix A, we compute the approximation Ak.There is a row for each term and a column for each document in AkThus documents live in a space of k << r dimensions (thesedimensions are not the original axes)Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.Claim – this is not only the mapping with the “best” approximation toA, but in fact improves retrieval.


Latent Semantic Indexing

Application of SVD - LSIA query q is also mapped into this space... within this space the query q is compared to all the documents... the document closest to the query are returned as search result

LSI - ResultsSimilar termsmap to similar location in low dimensional spaceNoise reduction by dimension reduction


EvaluationAssess the quality of search


Overview: Evaluation

Two basic questions / goalsAre retrieved documents relevant for the user’s information need?Have all relevant documents been retrieved?

In practice these two goal contradict each other.


Precision & Recall

Evaluation ScenarioFixed document collectionNumber of queriesSet of documents marked as relevant for each query

Main Evaluation Metrics

Precision = #(relevant items retrieved)#(retrieved items)

Recall = #(relevant items retrieved)#(relevant items)

F1 = 2PrecisionRecallPrecision+Recall


Precision & Recall


Advanced Evaluation Measures

Mean Average Precision - MAP... mean of all average precision of multiple queriesMap = 1

|Q|∑

q∈Q APq

APq = 1n∑n

k=1 Precision(k)(Normalised) discounted cumulative gain (nDCG)

... where there is a reference ranking the relevant results


Application & SolutionsFederated Search, Recommender Systems, Open-source, &

commercial IR solutions


Distributed Search - Architectures


Term Based Index Split

Index CreationThe index is split based on dictionary termsCalled Global Inverted Files or Global Index OrganizationExample:

1 Terms beginning with A to B2 Terms beginning with C to G3 ...

PropertiesSingle index is split over multiple machinesEach index returns part of its postings to brokerBroker merges postings to generate result⇒ High network traffic, high load at broker


Document Based Index Split

Index CreationThe index is split based on document categoriesCalled local inverted fileExample:

1 Documents from computer science2 Documents from physics3 ...

PropertiesEach partition is a full indexBroker unifies the result lists⇒ re-ranking problem at broker


Federated Search

DefinitionSimultaneous search in multiple search enginesSingle user interface or API to access multiple search engineExamples:

1 Google Scholar2 Expedia3 ...

PropertiesUnified access to multiple search enginesEach search engines is a full search on its own, not only an index⇒ Re-ranking problem at broker⇒ Access control


Recommender systems - Overview

Recommender systems should provide usable suggestion to usersRecommender systems should show new, unknown–butsimilar–documentsRecommender systems can utilize different information sourcesContent-based Recommender

Finds similar documents on the basis of documentcontent

Collaborative FileringEmploys user rating to find new and useful documents

Knowledge-based RecommenderMakes decisions based on pre-programmed rules

Hybrid RecommenderCombination of two or three other approaches


Types of Content-based Recommender

Document-to-Document Recommender

Document-to-User Recommender


Building a Recommender by Using an Index

Document-to-Document Recommender1 Take the terms from a document2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending

Document-to-User Recommender1 Create a user model from

I Terms typed by the userI Documents viewed by the userI ⇒ Simplest user model is just a set of terms

2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending


Open Source Search Engine

Apache LuceneVery popular, high qualityBackbone of many other search engines, e.g. Apache Solr (fully),Elasticsearch

I Lucene is the core search engine libraryI Solr provides a search solution (service & configuration infrastructure,

distribution, indexing, etc.)I Elasticsearch provides a search solution and document database

focusses on distributed

Implemented in Java, bindings for many other programminglanguages


Other Open-Source Search Engines

XapianWritten in C++, Support for many (programming) languagesUsed by many open-source projects

TerrierMany algorithmsAcademic background

SphinxWritten in C++Offers also features comparable to database

Whooshfor Pythonistas

Lemur & IndriLanguage modelsAcademic background


What customers want

Indexing pipeline

And also: user interface, access control mechanisms, logging, etc.


Commercial Search Engines

Dassault ExaleadDassault acquired the Exalead projectOffers specialised features for CAD/CAE/CAM files

SinequaExpert SearchCreates new views, i. e. do not only show result list, but combineresults to new unit

AttivioUse SQL statements in queries

IntraFindElasticearch-based commercial search solution

Andmany, manymore...


The EndNext: Pattern Mining