21
Overview of a Information Retrieval System: Terrier Ashish

Overview of a Information Retrieval System: Terrier Ashish

Embed Size (px)

Citation preview

Page 1: Overview of a Information Retrieval System: Terrier Ashish

Overview of a Information Retrieval System: Terrier

Ashish

Page 2: Overview of a Information Retrieval System: Terrier Ashish

overview

• Structural view– Indexing– Retrieval

• Extend

• Setup

• Run

Page 3: Overview of a Information Retrieval System: Terrier Ashish

IR Systems

• Terrier– Academic/ research– Open source

• Lucene-Nutch– Commercial/ research– Open source

Page 4: Overview of a Information Retrieval System: Terrier Ashish

Terrier

• Being developed at University of Glasgow.

• Open Source

• OS independent : Java

• Easy to learn

• Easy to extend– modular

Page 5: Overview of a Information Retrieval System: Terrier Ashish

Subfolders -1

• etc/ – Configuration files

• bin/– Srcipts to compile and run the terrier

• lib/– Java library, jar files containing the terrier

system.

Page 6: Overview of a Information Retrieval System: Terrier Ashish

Subfolders -2

• src/– The java source files, user written plugins

• doc/– Javadocs for terrier and for extended components

• var/– Index/

• Index files– Results/

• Results and evaluation

• share/– Shared resources such as stopwords, lexicon etc.

Page 7: Overview of a Information Retrieval System: Terrier Ashish

Indexing

Page 8: Overview of a Information Retrieval System: Terrier Ashish

Tokenization

• Identifying words – Based on space– Handling spacial characters such as -,$,

digits etc.– Sometimes space is not word separator.

• German, Chinese

– agglutinative languages• Marathi

Page 9: Overview of a Information Retrieval System: Terrier Ashish

Term Pipelining

• Stemming/ finding root– ate -> eat

• Stopword removal– is, was, I, in etc.

• Abbreviations– Dr -> Doctor

• Normalisation– Color Vs colour

Page 10: Overview of a Information Retrieval System: Terrier Ashish

Index – data structures

• Direct Index – stores the identifiers of terms that appear in each document and

the corresponding frequencies.

• Document Index – stores information about each document for example the

document length and identifier,

• Inverted Index – stores the posting lists, i.e. the identifiers of the documents and

their corresponding term frequencies.

• Lexicon – stores the collection vocabulary and the corresponding

document and term frequencies.

Page 11: Overview of a Information Retrieval System: Terrier Ashish

Extending the indexing process

• Tokenisation:– uk.ac.gla.terrier.indexing.*Document

• Term Pipelines:– uk.ac.gla.terrier.terms.*

Page 12: Overview of a Information Retrieval System: Terrier Ashish

Retrievalquery

Index

Page 13: Overview of a Information Retrieval System: Terrier Ashish

Scoring and Ranking

• Score: S(di,qj)

• Documents are ranked (sorted) according to the score

• Presented to the user in decreasing order of S(di,qj)

– Scoring model• e.g. TF-IDF

Page 14: Overview of a Information Retrieval System: Terrier Ashish

Matching Process

• Input– Query and weighting model

• Output– Ranked resultset

• Weighting model– Himestra-LM

• Uses– Term Score Modifiers

• uk.ac.gla.terrier.matching.tsms– Document Score Modifiers

• uk.ac.gla.terrier.matching.dsms

• extend– uk.ac.gla.terrier.matching.models

Page 15: Overview of a Information Retrieval System: Terrier Ashish

Input

• Corpus– Very large set of documents

• Topics– Queries representing user need

• Relevance Results– Set of judgments per query per document

Page 16: Overview of a Information Retrieval System: Terrier Ashish

Topic format<doc><docno>Mumbai85B7FB3BB9.htm.txt</docno>

<text> रा�ज्यपा�लां��नी घे�तलां रा�ष्ट्रपात, उपारा�ष्ट्रपात�ची भे�ट

मुं��बई, त�. २१ - रा�ज्यपा�लां एस. एमुं. कृ� ष्णा� य��नी आज रा�ष्ट्रपात प्रतितभे� पा�ट$लां आणिणा उपारा�ष्ट्रपात डॉ'. हमुंद अन्स�रा य��ची दिदल्लां य�थे� भे�ट घे�तलां. रा�ष्ट्रपात, उपारा�ष्ट्रपातितपाद$ तिनीवडॉ झा�ल्य�च्य� पा�र्श्34 वभे5मुंवरा रा�ज्यपा�लां��नी भे�ट घे�ऊनी त्य��ची� स्व�गत कृ� लां�. आज दुपा�रा रा�ष्ट्रपात भेवनी य�थे� श्रीमुंत प्रतितभे� पा�ट$लां य��ची भे�ट घे�तल्य�नी�तरा त्य��नी हरिराय�नी� भेवनी य�थे� ज�ऊनी उपारा�ष्ट्रपात�ची भे�ट घे�तलां.

</text>

</doc>

Page 17: Overview of a Information Retrieval System: Terrier Ashish

Document

<top><num>5<title>भे�रातय रा�ष्ट्रपात तिनीवडॉणा5कृ २००७<desc>भे�रात�च्य� रा�ष्ट्रपात तिनीवडॉणा5कृ?र्श् स�ब�धिAत मुं�द्दे� व घेटनी�.<narr>रा�ष्ट्रपात�ची तिनीवडॉणा5कृ, उमुं�दव�रा��तिवरूध्द कृ� लां�लां / गलिलांच्छ

रा�जकृ?य लिचीखलांफे� कृ आणिणा आपाल्य� तिनीकृटच्य� उमुं�दव�रा�ची� पारा�भेव कृरूनी प्रतितभे� पा�ट$लां ह्यां��ची� भे�रात�च्य� सव4प्रथेमुं मुंतिहलां� रा�ष्ट्रपात (अध्यक्ष) म्हणा5नी तिनीवडॉ5नी य�णा� ह्यां�-तिवषयची मुं�तिहत स�ब�धिAत कृ�गदपात्रा�त अस�वय�स हव.

</top>

Page 18: Overview of a Information Retrieval System: Terrier Ashish

.

.

.13 Q0 1100019.cms.txt 013 Q0 1102914.cms.txt 013 Q0 1104294.cms.txt 013 Q0 1104312.cms.txt 113 Q0 1110418.cms.txt 013 Q0 1123377.cms.txt 013 Q0 1124813.cms.txt 113 Q0 1126006.cms.txt 1....

Relevance Judement

Document idQuery-id

Relevence judgement: 0 or 1

Page 19: Overview of a Information Retrieval System: Terrier Ashish

Configuration files

• etc/terrier.properties– Utf-8 settings, stemmer, index name, etc

etc/trec.topic.list– set topics/queries

• etc/trec.models– Set matching/retrieval model

• etc\trec.qrels– Set Relevane Judgement file path

Page 20: Overview of a Information Retrieval System: Terrier Ashish

Running terrier

• Already compiled • To recompile

– bin/compile.sh• Setup corpus

– bin/trec_setup.sh “<corpus folder path>“• Index

– bin/trec_terrier.sh -i• Retrieval

– bin/trec_terrier.sh -r• Evaluate

– bin/trec_terrier.sh -e “<result file>”

Page 21: Overview of a Information Retrieval System: Terrier Ashish

Reference

• http://ir.dcs.gla.ac.uk/terrier/doc/

• http://ir.dcs.gla.ac.uk/wiki/Terrier