29
Score-based ranking of the documents Submitted By: Kriti Khanna(9910103499) F4, CSE, 4 th year

score based ranking of documents

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: score based ranking of documents

Score-based ranking of the documents

Submitted By: Kriti Khanna(9910103499)

F4, CSE, 4th year

Page 2: score based ranking of documents

OUTLINE

• Introduction• Literature Survey• Objective• Flowchart• Implementation• Tools and techniques• References

Page 3: score based ranking of documents

INTRODUCTION• Information Retrieval.• Ranking• Weight• Score

Page 4: score based ranking of documents

Information Retrieval

• We obtain information resources relevant to an information need from a collection of information resources.

• It is used to reduce information overload.• Best applications : web search engines, public

libraries use IR systems to provide access to books, journals and other documents.

Page 5: score based ranking of documents

Abstract Model of IR

Page 6: score based ranking of documents

Brief working of IR system

• User enters the query in his own language.• Query development function converts the user

query into formal query in order to harmonize it with the system's vocabulary of retrieval commands. It is 1 of the important intermediary step that takes place inside the database.

• Retrieved data is the complete or incomplete data which later on is being sorted to generate the final resultset.

Page 7: score based ranking of documents

Ranking

• To rank matching documents according to their relevance to a given search query.

• We do it by assigning a numerical score to each document based on a ranking function, which incorporates features of the document, the query, and the overall document collection.

Page 8: score based ranking of documents

Some simple ranking functions• Constant ranking function : the same score is assigned to all

documents.• Term frequency ranking function : counting the number of

times that each query term occurs in the document, then summing these.

• The tf-idf ranking function : computing the product of the term frequency and inverse document frequency for each query term, then summing these.

• Okapi BM25 : finding the idf of each query term, then summing these.

• Machine-learned ranking formulas, obtained automatically from training data by machine learning methods.

Page 9: score based ranking of documents

Score Calculation

• Score calculation for each document is done by multiplying the weights of each document and the query weight, then summing these.

Page 10: score based ranking of documents

Literature Survey

Page 11: score based ranking of documents

List of sources• Paper1 : Document similarity search

based on manifold ranking of Tex-Tiles.

• Paper2 : TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages.

Page 12: score based ranking of documents

List of sources

• Paper 3 : Comparison of rank-based vs score based aggregation for ensemble gene selection.

• Paper 4 : Several methods of ranking retrieval systems with partial relevance judgment.

Page 13: score based ranking of documents

Document similarity search based on manifold ranking of Tex-Tiles

• In this paper ranking of documents is done by using the tiling concept.

• Conclusion : it improves the retrieval performances based on different retrieval functions.

• Authors : Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.

• Place : Institute of Computer Science and Technology, Peking University, Beijing 100871, China.

Page 14: score based ranking of documents

TextTiling-Segmenting Text into Multi-paragraph Subtopic Passages• In this paper textiling is used to divide

each document into sub topics is being implemented.

• Conclusion : this technique has been useful for many text analysis tasks, including information retrieval and summarization.

• Authors : Marti A. Hearst

Page 15: score based ranking of documents

Comparison of rank-based vs score based aggregation for ensemble gene selection

• In this paper there is comparison of rank based and score based aggregation using different techniques (RF, MI, Dev, GM, ROC, PRC, S2N) by applying these techniques on different datasets, subsets.

• Conclusion : these 2 aggregation approaches work differently on different rankers.

• Authors : David J. ittman, Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano

Page 16: score based ranking of documents

Several methods of ranking retrieval systems with partial relevance judgment.

• This paper demonstrates that precision and recall undergo certain shortcomings when ranking is done with partial relevance judgment.

• conclusion : with partial relevance judgment, the evaluated results can be significantly different from the results with complete relevance judgment.

• Authors : Shengli Wu and Sally McClean.

Page 17: score based ranking of documents

Objective• It aims to find documents similar to a query

document in a text corpus and return a ranked list of similar documents.

• Ranking is done by calculating the query-document score.

Page 18: score based ranking of documents

Problem statement

• Documents are ranked based on standard score calculation i.e using the tf-idf concept.

• Formula for weighted tf : {1+log base 10 of (tf), tf > 0

0, otherwise }.• Formula for idf : log base 10 of (N/df). • Another way of ranking the documents is also

being studied i.e textiling. Further a precision recall graph will be plotted.

Page 19: score based ranking of documents

Steps involved

• Collection of files • Determining term frequency • Determining document frequency • (query, document ) set • Score calculation based on 4 different

techniques.

Page 20: score based ranking of documents

Design till now

Page 21: score based ranking of documents

Control flow graph

Page 22: score based ranking of documents

Description of functions

• Main : It calls all other functions by making objects of the subclasses.

• remWord : It is used to check if program is reading the files.

• deleteWords : It is used to delete the list of stop words from all the files and store the unique words of all files in a separate file.

Page 23: score based ranking of documents

Description of flowchart functions• countWords : It reads the unique terms from

the file and store them in a form of map along with their frequency.

• documentFreqVector : It makes a document vector. Corresponding to each term and document it sets 1s or 0s.

Page 24: score based ranking of documents

Weight Calculation

• It differs in documents and queries.• We use ddd.qqq notation to depict this

calculation.• Example: lnc.ltn

document: logarithmic tf, no df weighting, cosine normalization

query: logarithmic tf, idf, no normalization

Page 25: score based ranking of documents

Weight Varients

Page 26: score based ranking of documents

ApproachesAnc.btn and anc.ltn approaches

Page 27: score based ranking of documents

ApproachesNnc.btn and nnc.ltn apporaches

Page 28: score based ranking of documents

Tools and techniques• NetBeans : it is an integrated development environment (IDE) for

developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application platform framework for Java desktop applications and others. The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules.

• Java : it is a computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that code that runs on one platform does not need to be recompiled to run on another. Java applications are typically compiled to bytecode (class file) that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2014, one of the most popular programming languages in use, particularly for client-server web applications,

Page 29: score based ranking of documents

References• Wan,X. Yang, J. Xiao, J. (2001) Document Similarity Search Based on

Manifold-Ranking of TextTiles. Institute of Computer Science and Technology, Peking University, Beijing 100871, China.

• Hearst, M.A. TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Xerox PARC, California, USA.

• Dittman, DJ. Khoshgoftaar, TM. Wald, R. Napolitano, A. (2013). Comparison of Rank-Based vs. Score-Based Aggregation for Ensemble Gene Selection. Florida Atlantic University, Boca Raton, FL 33431.

• Wu, S. McClean, S. Several methods of ranking retrieval systems with partial relevance judgment. School of computing and mathematics, University of Ulster, UK.