score based ranking of documents

Score-based ranking of the documents

Submitted By: Kriti Khanna(9910103499)

F4, CSE, 4th year

OUTLINE

• Introduction• Literature Survey• Objective• Flowchart• Implementation• Tools and techniques• References

INTRODUCTION• Information Retrieval.• Ranking• Weight• Score

Information Retrieval

• We obtain information resources relevant to an information need from a collection of information resources.

• It is used to reduce information overload.• Best applications : web search engines, public

libraries use IR systems to provide access to books, journals and other documents.

Abstract Model of IR

Brief working of IR system

• User enters the query in his own language.• Query development function converts the user

query into formal query in order to harmonize it with the system's vocabulary of retrieval commands. It is 1 of the important intermediary step that takes place inside the database.

• Retrieved data is the complete or incomplete data which later on is being sorted to generate the final resultset.

Ranking

• To rank matching documents according to their relevance to a given search query.

• We do it by assigning a numerical score to each document based on a ranking function, which incorporates features of the document, the query, and the overall document collection.

Some simple ranking functions• Constant ranking function : the same score is assigned to all

documents.• Term frequency ranking function : counting the number of

times that each query term occurs in the document, then summing these.

• The tf-idf ranking function : computing the product of the term frequency and inverse document frequency for each query term, then summing these.

• Okapi BM25 : finding the idf of each query term, then summing these.

• Machine-learned ranking formulas, obtained automatically from training data by machine learning methods.

Score Calculation

• Score calculation for each document is done by multiplying the weights of each document and the query weight, then summing these.

Literature Survey

List of sources• Paper1 : Document similarity search

based on manifold ranking of Tex-Tiles.

• Paper2 : TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages.

List of sources

• Paper 3 : Comparison of rank-based vs score based aggregation for ensemble gene selection.

• Paper 4 : Several methods of ranking retrieval systems with partial relevance judgment.

Document similarity search based on manifold ranking of Tex-Tiles

• In this paper ranking of documents is done by using the tiling concept.

• Conclusion : it improves the retrieval performances based on different retrieval functions.

• Authors : Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.

• Place : Institute of Computer Science and Technology, Peking University, Beijing 100871, China.

TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages• In this paper textiling is used to divide

each document into sub topics is being implemented.

• Conclusion : this technique has been useful for many text analysis tasks, including information retrieval and summarization.

• Authors : Marti A. Hearst

Comparison of rank-based vs score based aggregation for ensemble gene selection

• In this paper there is comparison of rank based and score based aggregation using different techniques (RF, MI, Dev, GM, ROC, PRC, S2N) by applying these techniques on different datasets, subsets.

• Conclusion : these 2 aggregation approaches work differently on different rankers.

• Authors : David J. ittman, Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano

Several methods of ranking retrieval systems with partial relevance judgment.

• This paper demonstrates that precision and recall undergo certain shortcomings when ranking is done with partial relevance judgment.

• conclusion : with partial relevance judgment, the evaluated results can be significantly different from the results with complete relevance judgment.

• Authors : Shengli Wu and Sally McClean.

Objective• It aims to find documents similar to a query

document in a text corpus and return a ranked list of similar documents.

• Ranking is done by calculating the query-document score.

Problem statement

• Documents are ranked based on standard score calculation i.e using the tf-idf concept.

• Formula for weighted tf : {1+log base 10 of (tf), tf > 0

0, otherwise }.• Formula for idf : log base 10 of (N/df). • Another way of ranking the documents is also

being studied i.e textiling. Further a precision recall graph will be plotted.

Steps involved

• Collection of files • Determining term frequency • Determining document frequency • (query, document ) set • Score calculation based on 4 different

techniques.

Design till now

Control flow graph

Description of functions

• Main : It calls all other functions by making objects of the subclasses.

• remWord : It is used to check if program is reading the files.

• deleteWords : It is used to delete the list of stop words from all the files and store the unique words of all files in a separate file.

Description of flowchart functions• countWords : It reads the unique terms from

the file and store them in a form of map along with their frequency.

• documentFreqVector : It makes a document vector. Corresponding to each term and document it sets 1s or 0s.

Weight Calculation

• It differs in documents and queries.• We use ddd.qqq notation to depict this

calculation.• Example: lnc.ltn

document: logarithmic tf, no df weighting, cosine normalization

query: logarithmic tf, idf, no normalization

Weight Varients

ApproachesAnc.btn and anc.ltn approaches

ApproachesNnc.btn and nnc.ltn apporaches

Tools and techniques• NetBeans : it is an integrated development environment (IDE) for

developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application platform framework for Java desktop applications and others. The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules.

• Java : it is a computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that code that runs on one platform does not need to be recompiled to run on another. Java applications are typically compiled to bytecode (class file) that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2014, one of the most popular programming languages in use, particularly for client-server web applications,

References• Wan,X. Yang, J. Xiao, J. (2001) Document Similarity Search Based on

Manifold-Ranking of TextTiles. Institute of Computer Science and Technology, Peking University, Beijing 100871, China.

• Hearst, M.A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Xerox PARC, California, USA.

• Dittman, DJ. Khoshgoftaar, TM. Wald, R. Napolitano, A. (2013). Comparison of Rank-Based vs. Score-Based Aggregation for Ensemble Gene Selection. Florida Atlantic University, Boca Raton, FL 33431.

• Wu, S. McClean, S. Several methods of ranking retrieval systems with partial relevance judgment. School of computing and mathematics, University of Ulster, UK.

Engineering

score based ranking of documents