Upload
charity-whitehead
View
214
Download
0
Embed Size (px)
Citation preview
Overview of a Information Retrieval System: Terrier
Ashish
overview
• Structural view– Indexing– Retrieval
• Extend
• Setup
• Run
IR Systems
• Terrier– Academic/ research– Open source
• Lucene-Nutch– Commercial/ research– Open source
Terrier
• Being developed at University of Glasgow.
• Open Source
• OS independent : Java
• Easy to learn
• Easy to extend– modular
Subfolders -1
• etc/ – Configuration files
• bin/– Srcipts to compile and run the terrier
• lib/– Java library, jar files containing the terrier
system.
Subfolders -2
• src/– The java source files, user written plugins
• doc/– Javadocs for terrier and for extended components
• var/– Index/
• Index files– Results/
• Results and evaluation
• share/– Shared resources such as stopwords, lexicon etc.
Indexing
Tokenization
• Identifying words – Based on space– Handling spacial characters such as -,$,
digits etc.– Sometimes space is not word separator.
• German, Chinese
– agglutinative languages• Marathi
Term Pipelining
• Stemming/ finding root– ate -> eat
• Stopword removal– is, was, I, in etc.
• Abbreviations– Dr -> Doctor
• Normalisation– Color Vs colour
Index – data structures
• Direct Index – stores the identifiers of terms that appear in each document and
the corresponding frequencies.
• Document Index – stores information about each document for example the
document length and identifier,
• Inverted Index – stores the posting lists, i.e. the identifiers of the documents and
their corresponding term frequencies.
• Lexicon – stores the collection vocabulary and the corresponding
document and term frequencies.
Extending the indexing process
• Tokenisation:– uk.ac.gla.terrier.indexing.*Document
• Term Pipelines:– uk.ac.gla.terrier.terms.*
Retrievalquery
Index
Scoring and Ranking
• Score: S(di,qj)
• Documents are ranked (sorted) according to the score
• Presented to the user in decreasing order of S(di,qj)
– Scoring model• e.g. TF-IDF
Matching Process
• Input– Query and weighting model
• Output– Ranked resultset
• Weighting model– Himestra-LM
• Uses– Term Score Modifiers
• uk.ac.gla.terrier.matching.tsms– Document Score Modifiers
• uk.ac.gla.terrier.matching.dsms
• extend– uk.ac.gla.terrier.matching.models
Input
• Corpus– Very large set of documents
• Topics– Queries representing user need
• Relevance Results– Set of judgments per query per document
Topic format<doc><docno>Mumbai85B7FB3BB9.htm.txt</docno>
<text> रा�ज्यपा�लां��नी घे�तलां रा�ष्ट्रपात, उपारा�ष्ट्रपात�ची भे�ट
मुं��बई, त�. २१ - रा�ज्यपा�लां एस. एमुं. कृ� ष्णा� य��नी आज रा�ष्ट्रपात प्रतितभे� पा�ट$लां आणिणा उपारा�ष्ट्रपात डॉ'. हमुंद अन्स�रा य��ची दिदल्लां य�थे� भे�ट घे�तलां. रा�ष्ट्रपात, उपारा�ष्ट्रपातितपाद$ तिनीवडॉ झा�ल्य�च्य� पा�र्श्34 वभे5मुंवरा रा�ज्यपा�लां��नी भे�ट घे�ऊनी त्य��ची� स्व�गत कृ� लां�. आज दुपा�रा रा�ष्ट्रपात भेवनी य�थे� श्रीमुंत प्रतितभे� पा�ट$लां य��ची भे�ट घे�तल्य�नी�तरा त्य��नी हरिराय�नी� भेवनी य�थे� ज�ऊनी उपारा�ष्ट्रपात�ची भे�ट घे�तलां.
</text>
</doc>
Document
<top><num>5<title>भे�रातय रा�ष्ट्रपात तिनीवडॉणा5कृ २००७<desc>भे�रात�च्य� रा�ष्ट्रपात तिनीवडॉणा5कृ?र्श् स�ब�धिAत मुं�द्दे� व घेटनी�.<narr>रा�ष्ट्रपात�ची तिनीवडॉणा5कृ, उमुं�दव�रा��तिवरूध्द कृ� लां�लां / गलिलांच्छ
रा�जकृ?य लिचीखलांफे� कृ आणिणा आपाल्य� तिनीकृटच्य� उमुं�दव�रा�ची� पारा�भेव कृरूनी प्रतितभे� पा�ट$लां ह्यां��ची� भे�रात�च्य� सव4प्रथेमुं मुंतिहलां� रा�ष्ट्रपात (अध्यक्ष) म्हणा5नी तिनीवडॉ5नी य�णा� ह्यां�-तिवषयची मुं�तिहत स�ब�धिAत कृ�गदपात्रा�त अस�वय�स हव.
</top>
.
.
.13 Q0 1100019.cms.txt 013 Q0 1102914.cms.txt 013 Q0 1104294.cms.txt 013 Q0 1104312.cms.txt 113 Q0 1110418.cms.txt 013 Q0 1123377.cms.txt 013 Q0 1124813.cms.txt 113 Q0 1126006.cms.txt 1....
Relevance Judement
Document idQuery-id
Relevence judgement: 0 or 1
Configuration files
• etc/terrier.properties– Utf-8 settings, stemmer, index name, etc
etc/trec.topic.list– set topics/queries
• etc/trec.models– Set matching/retrieval model
• etc\trec.qrels– Set Relevane Judgement file path
Running terrier
• Already compiled • To recompile
– bin/compile.sh• Setup corpus
– bin/trec_setup.sh “<corpus folder path>“• Index
– bin/trec_terrier.sh -i• Retrieval
– bin/trec_terrier.sh -r• Evaluate
– bin/trec_terrier.sh -e “<result file>”
Reference
• http://ir.dcs.gla.ac.uk/terrier/doc/
• http://ir.dcs.gla.ac.uk/wiki/Terrier