Information Retrieval and EvaluationKnowledge Discovery and Data Mining 2 (VU) (707.004)
Heimo Gursch, Roman Kern
ISDS, TU Graz
2019-04-04
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 1 / 66
Outline1 Basics
FoundationsInverted IndexText Preprocessing
2 Advanced ConceptsRelevance FeedbackQuery ProcessingLearning-To-RankProbabilistic IRLatent Semantic Indexing
3 Evaluation4 Applications & Solutions
Distributed & Federated SearchContent-based Recommender systemsOpen Source SolutionsCommercial Solutions
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 2 / 66
BasicsInverted Index & Text-Preprocessing
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 3 / 66
Foundations
Definition
Information Retrieval (IR) is findingmaterial (usually documents) of anunstructured nature (usually text) that satisfies an information need fromwithin large collections (usually stored on computers).
The term “term” is very common in Information Retrieval and typically refers to single words (alsocommon: bi-grams, phrases, ...)
Basic assumptions of Information RetrievalCollection Fixed set of documents
Goal Retrieve documents with information that is relevant touser’s information need and help her complete a task.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 4 / 66
Structured, Semi-Structured Data
Structured dataFormat and type of information is knownIR task: normalize representations between different sourcesExample: SQL databases
Semi-structured dataCombination of structured and unstructured dataIR task: extract information from the unstructured part and combine itwith the structured informationExample: email, document with meta-data
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 5 / 66
Unstructured data
Unstructured dataFormat of information is not know, information is hidden in(human-readable) textIR task: extract information at allExample: blog-posts, Wikipedia articles
This lecture focuses on unstructured data.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 6 / 66
Overview Indexed Searching
Split processIndexing Convert all input documents into a data-structure suitable for
fast searchingSearching Take an input query andmap it onto the pre-processed
data-structure to retrieve the matching documents
Inverted indexReverses the relationship between documents and contained terms(words)Central data-structure of a search engine
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 7 / 66
Information Stored in an Inverted Index
DictionarySorted List of all terms found in all documents
PostingsInformation about the occurrence of a term, consisting of:Document (DocID) Document identifier, i. e. , in which document the term
was foundPosition(s)(Pos) Location(s) of the term in den original documentTerm Frequency (TF) How often the term occurs in the documenta
aThe term frequencies could also be calculated by counting the positionentries. As computing time is more precious thanmemory, the term frequenciesare stored as well.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 8 / 66
Example
Document 1Sepp was stopping off in Hawaii. Sepp is a co-worker of Heidi.
Document 2Heidi stopped going to Hawaii. Heidi is a co-worker of Sepp.
Dictionary 7→ Postings
co-worker 7→ (Doc:1; TF:1; Pos:10); (Doc:2; TF:1; Pos:9)going 7→ (Doc:2; TF:1; Pos:3)Hawaii 7→ (Doc:1; TF:1; Pos:6); (Doc:2; TF:1; Pos:5)Heidi 7→ (Doc:1; TF:1; Pos:12); (Doc:2; TF:2; Pos:1,6)Sepp 7→ (Doc:1; TF:2; Pos:1,7); (Doc:2; TF:1; Pos:11)stopped 7→ (Doc:2; TF:1; Pos:2)stopping off 7→ (Doc1; TF:1; Pos 3)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 9 / 66
Properties
PropertiesSearch for termsFormulate boolean queriesSearch for phrases (i. e. , terms in a particular order)Information where the term occurred in the documentPossible creation of snippetsDocuments can be ranked by the number of terms they contain
Keep in mind that an inverted index... takes up additional space, not only storing the original documents,but also the inverted index... must be kept up to date
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 10 / 66
Text Preprocessing
Tomake a text indexable, a couple of steps are necessary:1 Detect the language of the text2 Detect sentence boundaries3 Detect term (word) boundaries4 Stemming and normalization5 Remove stop words
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 11 / 66
Language Detection
Letter Frequencies
Percentage of occurrenceLetter English German
A 8.17 6.51E 12.70 17.40I 6.97 7.55O 7.51 2.51U 2.76 4.35
N-Gram FrequenciesBetter–more elaborate–methods use statistics over more than one letter,e. g. statistics over two, three or evenmore consecutive letters.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 12 / 66
Sentence Detection
First approachEvery period marks the end of a sentence
ProblemPeriods also mark abbreviations, decimal points,email-addresses, etc.
Utilize other featuresIs the next letter capitalized?Is the term in front of the period a known abbreviation?Is the period surround by digits?...
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 13 / 66
Tokenization – Problem
GoalSplit sentences into tokensThrow away punctuations
Possible Pit FallsWhite spaces are no safe delimiter for word boundaries(e. g. New York)Non-alphanumerical characters can separate words, but need not(e. g. co-worker)Different forms (e. g.white space vs. whitespace)German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft,meaning Danube Steamboat Shipping Company)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 14 / 66
Tokenization – Solutions
Supervised LearningTrain a model with annotated training data, use the trainedmodel totokenize unknown textHidden-Markov-Models and conditional random fields are commonlyused
Dictionary ApproachBuild a dictionary (i. e. list ) of tokensGo over the sentence and always take the longest fitting token (greedyalgorithm!)Remark: Works well for Asian languages without white spaces andshort words. Problematic for European languages
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 15 / 66
Normalization
Some definitionsToken Sequence of characters representing a useful semantic unit,
instance of a typeType Common concept for tokens, element of the vocabularyTerm Type that is stored in the dictionary of an index
Task1 Map all possible tokens to the corresponding type2 Store all representing terms of one type in the dictionary
SolutionsProvide list of different representationsProvide rules for mapping representations to main formMap different spelling version by using a phonetic algorithm
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 16 / 66
Example
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 17 / 66
Stemming & Lemmatization
DefinitionGoal of both Reduce different grammatical forms of a word to their
common infinitivea
Stemming Usage of heuristics to chop off / replace last part of words(Example: Porter’s algorithm).
Lemmatization Usage of proper grammatical rules to recreate theinfinitive.
aThe common infinitive is the form of a word how it is written in a dictionary.This form is called lemma, hence the name Lemmatization.
ExamplesMap do, doing, done, to common infinitive doMap going,went, gone to go
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 18 / 66
Stop Words
Definition Extremely common words that appear in nearly every textProblem As stop words are so common, their occurrence does not
characterise a textSolution Just drop them, i. e. , do not put them in the index directory
Common stop wordsWrite a list with stop words (List might be topic specific!)Usual suspects: articles (a, an, the), conjunction (and, or, but, ...), pre-and postposition (in, for, from), etc.
SolutionStop word list Ignore word that are given on a list (black list)
Problem Special names and phrases (The Who, Let It Be, ...)Solution Make another list... (white list)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 19 / 66
Advanced ConceptsRelevance Feedback, Query Processing, ...
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 20 / 66
Relevance Feedback
Integrate feedback from the userGiven a search result for a user’s query... allow the user to rate the retrieved results (good vs. not-so-goodmatch)This information can then be used to re-rank the results to match theuser’s preferenceOften automatically inferred from the user’s behaviour using thesearch query logs... and ultimately improve the search experience
The query logs (optimally) contain the queries plus the items the user has clicked on (the user’s session)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 21 / 66
Relevance Feedback
Pseudo relevance feedbackNo user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a goodmatchAs if the users has clicked on the first results andmarked them as goodmatch→ often improves the quality of the search results
Pseudo relevance feedback is sometimes also called blind relevance feedback
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 22 / 66
Query Reformulation
Modify the query, during of after the user interactionQuery suggestion & completion(Automatic) query correction(Automatic) query expansion
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 23 / 66
Query Suggestion
The user start typing a query... automatically gets suggestions→ to complete the currently typed word→ to complete a whole phrase (e.g. the next word)→ suggest related queries
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 24 / 66
Query Correction
The user entered a query, which may contain spelling errors... automatically correct or suggest possible rectified queries... also known as Query Reformulation→ use other users behaviour (harvest query logs)→ use the frequency of terms found in the indexed documents (andthe similarity with the entered words)
Notebook with various distances (Levenshtein being the best known):http://nbviewer.ipython.org/gist/MajorGressingham/7691723
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 25 / 66
Using Google as Spell Checker
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 26 / 66
Query Expansion
The user has entered a query... automatically add (related) words to the query→ Global query expansion - only use the query plus corpus orbackground knowledge→ Local query expansion - conduct the search and analyse the searchresults
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 27 / 66
Global Query Expansion
Uses only the query and background knowledgeExample: Use of a thesaurus
I Query: [buy car]I Use synonyms from the thesaurusI Expanded query: [(buy OR purchase) (car OR auto)]
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 28 / 66
Local query expansion
Uses the query plus the search results... conceptually similar to the pseudo relevance feedbackLook for terms in the first n search results and add them to the queryRe-run the query with the added terms
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 29 / 66
More Like This
A common feature of search engines is themore like this searchGiven a search results... the user wants items similar to one of the presented resultsThis can be seen as: taking a document as query→which is a common implementation (but restricting the query to themost discriminative terms)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 30 / 66
Learning-To-Rank
Motivation fromWeb-SearchRelevance might be due to a lot of different reasons
I Keywords in the text or the meta-dataI Anchor textsI PageRankI ... many other
Basic idea:I Use evidence (from users)I ... which source of informationI ... is the most beneficialI ... by the use of Machine Learning
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 31 / 66
Learning-To-Rank
Supervised Learning - SettingNumber of queries -Q = {q1, q2, . . . , qm}Set of documents -D = {d1, d2, . . . , dN}
I Each queries has labelled relevant documentsI With labelsY = {1, 2, . . . , l}I being ranked, i. e. l � l − 1 � · · · � 1I Relevant documents for query qi: Di = {di,1, di,2, . . . , di,ni}I Labels for the query qi: yyy i = {yi,1, yi,2, . . . , yi,ni}
Training data set: S = {(qi,Di),yyy i}mi=1I By replacing (qi,Di) by representative features
H. Li, “A Short Introduction to Learning to Rank,” IEICE Trans. Inf. Syst., vol. E94-D, no. 1, pp.1–2, 2011.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 32 / 66
Learning-To-Rank
Supervised Learning - Data LabellingWhere to take the labelling information from?
I Directly from human judgementsF e. g. , perfect, excellent, good, fair, bad
I Indirectly via user behaviourF e. g. , via Web logs analysis
Note: Ordinal classification (ordinal regression) vs. ranking, i. e. , ranking is about recreatingthe expected ordering
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 33 / 66
Learning-To-Rank
Supervised Learning - Main ApproachesPointwise approach
I Transform the problem in a classification, regression or ordinalclassification
I e. g. , SVM for ordinal classification, i. e. , find parallel hyperplanesPairwise approach
I Transform the problem in a pairwise classification, pairwise regressionI e. g. , Ranking SVM, i. e. , to classify the order of pairs
Listwise approachI Directly learn/optimise the rankingI e.g. SVM MAP, i.e. to optimise on a scoring function (goodness of
ranking)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 34 / 66
Information flow
The classic search model
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 35 / 66
Probabilistic Information Retrial
Every transformation can hold a possible information loss1 Task to query2 Query to document
Information loss leads to uncertaintyProbability theory can deal with uncertainty
Basic idea behind probabilistic IRModel the uncertainty, which originates in the imperfecttransformationsUse this uncertainty as a measurement how relevant a document is,given a certain query
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 36 / 66
Probabilistic Information Retrial
Basic modelGiven a query - q
I ... and an (unknown/uncertain) set of relevant document -DI ... and irrelevant documents - ¬D
Representation of a document - diP(D|di) is the probability that di is relevant and P(¬D|di) that it isirrelevantRank documents in decreasing order of probability of relevance -P(D|q, di)
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 37 / 66
Probabilistic Information Retrial Models
Binary Independence ModelBinary: documents and queries represented as binaryterm incidence vectorsIndependence: terms are assumed to be independentof each other
Note: Computing the “true” probabilities would not be feasible
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 38 / 66
Language Models
Basic IdeaTrain a probabilistic modelMd for document d, i. e. , as many trainedmodels as documentsMd ranks documents on how possible it is that the user’s query qwascreated by the document dResults are documentmodels (or documents, respectively) which havea high probability to generate the user’s query
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 39 / 66
Language Models
Alternative point of viewIntuitively, each document defines a “language”What is the probability to generate the given query, given a languagemodelWhat is the probability to generate the given document, given alanguage model
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 40 / 66
Examples for Language Models
Successive Probability of query termsP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t1t2) . . . P(tn|t1 . . . tn−1)
Unigram Languge ModelPuni(t1t2 . . . tn) = P(t1)P(t2) . . . P(tn)
Bigram Languge ModelP(t1t2 . . . tn) = P(t1)P(t2|t1)P(t3|t2) . . . P(tn|tn−1)
Probability P(tn) Probabilities are generated from the term occurrence inthe document in question
Andmanymore... A lot more, andmore complex models are available
Note: Smoothing plays an important role in languagemodels
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 41 / 66
Index Construction
Going beyond simple indexingTerm-document matrices are very largeBut the number of topics that people talk about is small (in somesense)
I Clothes, movies, politics, ...
Can we represent the term-document space by a lower dimensionallatent space?
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 42 / 66
Index Construction
Latent Semantic Indexing (LSI)is based on Latent Semantic Analysis (see KDDM1), whichis based on Singular Value Decomposition (SVD), whichcreates a new space (new orthonormal base), whichallows to bemapped to lower dimensions, whichmight improve the search performance
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 43 / 66
Singular Value Decomposition
SVD overviewFor anm× nmatrix A of rank r there exists a factorization (SingularValue Decomposition = SVD) as follows:
I M = UΣVT
I ... U is anm×m unitary matrixI ... Σ is anm× n diagonal matrixI ... VT is an n× n unitary matrix
The columns of U are orthogonal eigenvectors of AAT
The columns of V are orthogonal eigenvectors of ATAEigenvalues λ1...λr of AAT are the eigenvalues of ATA.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 44 / 66
Singular Value Decomposition
Matrix DecompositionSVD enables lossy compression of a term-document matrix
I Reduces the dimensionality or the rankI Arbitrarily reduce the dimensionality by putting zeros in the bottom
right of sigmaI This is a mathematically optimal way of reducing dimensions
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 45 / 66
Singular Value Decomposition
Reduced SVDIf we retain only k singular values, and set the rest to 0ThenΣ is k × k, U ism× k, VT is k × n, and Ak ism× nThis is referred to as the reduced SVDIt is the convenient (space-saving) and usual form for computationalapplications
Approximation ErrorHow good (bad) is this approximation?It’s the best possible, measured by the Frobenius norm of the error
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 46 / 66
Latent Semantic Indexing
Application of SVD - LSIFrom term-document matrix A, we compute the approximation Ak.There is a row for each term and a column for each document in AkThus documents live in a space of k << r dimensions (thesedimensions are not the original axes)Each row and column of A gets mapped into the k-dimensional LSIspace, by the SVD.Claim – this is not only the mapping with the “best” approximation toA, but in fact improves retrieval.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 47 / 66
Latent Semantic Indexing
Application of SVD - LSIA query q is also mapped into this space... within this space the query q is compared to all the documents... the document closest to the query are returned as search result
LSI - ResultsSimilar termsmap to similar location in low dimensional spaceNoise reduction by dimension reduction
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 48 / 66
EvaluationAssess the quality of search
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 49 / 66
Overview: Evaluation
Two basic questions / goalsAre retrieved documents relevant for the user’s information need?Have all relevant documents been retrieved?
In practice these two goal contradict each other.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 50 / 66
Precision & Recall
Evaluation ScenarioFixed document collectionNumber of queriesSet of documents marked as relevant for each query
Main Evaluation Metrics
Precision = #(relevant items retrieved)#(retrieved items)
Recall = #(relevant items retrieved)#(relevant items)
F1 = 2PrecisionRecallPrecision+Recall
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 51 / 66
Precision & Recall
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 52 / 66
Advanced Evaluation Measures
Mean Average Precision - MAP... mean of all average precision of multiple queriesMap = 1
|Q|∑
q∈Q APq
APq = 1n∑n
k=1 Precision(k)(Normalised) discounted cumulative gain (nDCG)
... where there is a reference ranking the relevant results
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 53 / 66
Application & SolutionsFederated Search, Recommender Systems, Open-source, &
commercial IR solutions
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 54 / 66
Distributed Search - Architectures
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 55 / 66
Term Based Index Split
Index CreationThe index is split based on dictionary termsCalled Global Inverted Files or Global Index OrganizationExample:
1 Terms beginning with A to B2 Terms beginning with C to G3 ...
PropertiesSingle index is split over multiple machinesEach index returns part of its postings to brokerBroker merges postings to generate result⇒ High network traffic, high load at broker
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 56 / 66
Document Based Index Split
Index CreationThe index is split based on document categoriesCalled local inverted fileExample:
1 Documents from computer science2 Documents from physics3 ...
PropertiesEach partition is a full indexBroker unifies the result lists⇒ re-ranking problem at broker
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 57 / 66
Federated Search
DefinitionSimultaneous search in multiple search enginesSingle user interface or API to access multiple search engineExamples:
1 Google Scholar2 Expedia3 ...
PropertiesUnified access to multiple search enginesEach search engines is a full search on its own, not only an index⇒ Re-ranking problem at broker⇒ Access control
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 58 / 66
Recommender systems - Overview
Recommender systems should provide usable suggestion to usersRecommender systems should show new, unknown–butsimilar–documentsRecommender systems can utilize different information sourcesContent-based Recommender
Finds similar documents on the basis of documentcontent
Collaborative FileringEmploys user rating to find new and useful documents
Knowledge-based RecommenderMakes decisions based on pre-programmed rules
Hybrid RecommenderCombination of two or three other approaches
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 59 / 66
Types of Content-based Recommender
Document-to-Document Recommender
Document-to-User Recommender
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 60 / 66
Building a Recommender by Using an Index
Document-to-Document Recommender1 Take the terms from a document2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending
Document-to-User Recommender1 Create a user model from
I Terms typed by the userI Documents viewed by the userI ⇒ Simplest user model is just a set of terms
2 Construct a (weighted) query with them3 Query the index4 Results are possible candidates for recommending
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 61 / 66
Open Source Search Engine
Apache LuceneVery popular, high qualityBackbone of many other search engines, e.g. Apache Solr (fully),Elasticsearch
I Lucene is the core search engine libraryI Solr provides a search solution (service & configuration infrastructure,
distribution, indexing, etc.)I Elasticsearch provides a search solution and document database
focusses on distributed
Implemented in Java, bindings for many other programminglanguages
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 62 / 66
Other Open-Source Search Engines
XapianWritten in C++, Support for many (programming) languagesUsed by many open-source projects
TerrierMany algorithmsAcademic background
SphinxWritten in C++Offers also features comparable to database
Whooshfor Pythonistas
Lemur & IndriLanguage modelsAcademic background
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 63 / 66
What customers want
Indexing pipeline
And also: user interface, access control mechanisms, logging, etc.
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 64 / 66
Commercial Search Engines
Dassault ExaleadDassault acquired the Exalead projectOffers specialised features for CAD/CAE/CAM files
SinequaExpert SearchCreates new views, i. e. do not only show result list, but combineresults to new unit
AttivioUse SQL statements in queries
IntraFindElasticearch-based commercial search solution
Andmany, manymore...
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 65 / 66
The EndNext: Pattern Mining
Heimo Gursch, Roman Kern (ISDS, TU Graz) Information Retrieval and Evaluation 2019-04-04 66 / 66