145
CSE 5243 INTRO. TO DATA MINING Slides adapted from Prof. Srinivasan Parthasarathy @OSU Graph Data & Introduction to Information Retrieval Huan Sun, CSE@The Ohio State University 11/21/2017

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

CSE 5243 INTRO. TO DATA MINING

Slides adapted from Prof. Srinivasan Parthasarathy @OSU

Graph Data & Introduction to Information Retrieval

Huan Sun, CSE@The Ohio State University 11/21/2017

Page 2: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

2

GRAPH BASICS AND A GENTLE INTRODUCTION TO PAGERANKSlides adapted from Prof. Srinivasan Parthasarathy @OSU

Chapter 4 Graph Data: http://www.dataminingbook.info/pmwiki.php/Main/BookPathUploads?action=downloadman&upname=book-20160121.pdf , http://www.dataminingbook.info/pmwiki.php

Page 3: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

3

Background

Besides the keywords, what other evidence can one use to rate the importance of a webpage?

Solution: Use the hyperlink structure

E.g. a webpage linked by many webpages is probably important. but this method is not global (comprehensive).

PageRank is developed by Larry Page in 1998.

Page 4: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

4

Idea

A graph representing WWW Node: webpage Directed edge: hyperlink

A user randomly clicks the hyperlink to surf WWW. The probability a user stop in a particular webpage is the PageRank value.

A node that is linked by many nodes with high PageRank value receives a high rank itself; If there are no links to a node, then there is no support for that page.

Page 5: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

5

Formal Formulation

Page 6: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

6

Formal Formulation

Page 7: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

7

Iterative Computation

Page 8: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

8

Example 1

PageRank Calculation: first iteration

=the transpose of A (adjacency matrix)

Page 9: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

9

Example 1

PageRank Calculation: second iteration

Page 10: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

10

Example 1

Convergence after some iterations

Page 11: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

11

A simple version

u: a webpage Bu: the set of u’s backlinks Nv: the number of forward links of page v

Initially, R(u) is 1/N for every webpage Iteratively update each webpage’s PR value until

convergence.

∑ ∈=

uBvvNvRuR )()(

Page 12: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

12

A little more advanced version

Adding a damping factor d Imagine that a surfer would stop clicking a hyperlink with probability

1-d

R(u) is at least (1-d)/(N-1) N is the total number of nodes.

∑ ∈+

−−

=uBv

vNvRd

NduR )(1)1()(

Page 13: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

13

Other applications

Social network (Facebook, Twitter, etc) Node: Person; Edge: Follower / Followee / Friend Higher PR value: Celebrity

Citation network Node: Paper; Edge: Citation Higher PR values: Important Papers.

Protein-protein interaction network Node: Protein; Edge: Two proteins bind together Higher PR values: Essential proteins.

Page 14: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

SEARCH ENGINES

INFORMATION RETRIEVAL IN PRACTICE

BOOK: HTTP://CIIR.CS.UMASS.EDU/DOWNLOADS/SEIRIP.PDF

SLIDES:HTTP://WWW.SEARCH-ENGINES-BOOK.COM/SLIDES/

All slides ©Addison Wesley, 2008

Slides adapted from Prof. W. Bruce Crof @UMASS

Page 15: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

15

Search Engines and Information Retrieval

Information Retrieval in PracticeAll slides ©Addison Wesley, 2008

Page 16: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

16

Search and Information Retrieval

Search on the Web is a daily activity for many people throughout the world

Search and communication are most popular uses of the computer Applications involving search are everywhere The field of computer science that is most involved with R&D for search

is information retrieval (IR)

Page 17: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

17

Information Retrieval

“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)

General definition that can be applied to many types of information and search applications

Primary focus of IR since the 50s has been on text and documents

Page 18: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

18

What is a Document?

Examples: web pages, email, books, news stories, scholarly papers, text messages,

Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.

Common properties Significant text content Some structure (e.g., title, author, date for papers; subject, sender,

destination for email)

Page 19: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

19

Documents vs. Database Records

Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses,

social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order to find matches

Text is more difficult

Page 20: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

20

Documents vs. Records

Example bank database query Find records with balance > $50,000 in branches located in Amherst, MA. Matches easily found by comparison with field values of records

Example search engine query bank scandals in western mass This text must be compared to the text of entire news stories

Page 21: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

21

Comparing Text

Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval

Exact matching of words is not enough Many different ways to write the same thing in a “natural language” like

English e.g., does a news story containing the text “bank director in Amherst steals

funds” match the query? Some stories will be better matches than others

Page 22: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

22

Dimensions of IR

IR is more than just text, and more than just web search although these are central

People doing IR work with different media, different types of search applications, and different tasks

Page 23: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

23

Other Media

New applications increasingly involve new media e.g., video, photos, music, speech

Like text, content is difficult to describe and compare text may be used to represent them (e.g. tags)

IR approaches to search and evaluation are appropriate

Page 24: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

24

Dimensions of IR

Content Applications Tasks

Text Web search Ad hoc search

Images Vertical search Filtering

Video Enterprise search Classification

Scanned docs Desktop search Question answering

Audio Forum search

Music P2P search

Literature search

Page 25: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

25

IR Tasks

Ad-hoc search Find relevant documents for an arbitrary text query

Filtering Identify relevant user profiles for a new document

Classification Identify relevant labels for documents

Question answering Give a specific answer to a question

Page 26: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

26

Big Issues in IR

Relevance What is it? Simple (and simplistic) definition: A relevant document contains the

information that a person was looking for when they submitted a query to the search engine

Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

Topical relevance (same topic) vs. user relevance (everything else)

Page 27: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

27

Big Issues in IR

Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines are based on

retrieval models Most models describe statistical properties of text

rather than linguistic i.e. counting simple text features such as words instead of

parsing and analyzing the sentences Statistical approach to text processing started with Luhn in

the 50s Linguistic features can be part of a statistical model

Page 28: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

28

Big Issues in IR

Evaluation Experimental procedures and measures for comparing

system output with user expectationsOriginated in Cranfield experiments in the 60s

IR evaluation methods now used in many fields Typically use test collection of documents, queries, and

relevance judgmentsMost commonly used are TREC collections

Recall and precision are two examples of effectivenessmeasures

Page 29: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

29

Big Issues in IR

Users and Information Needs Search evaluation is user-centered Keyword queries are often poor descriptions of actual information needs Interaction and context are important for understanding user intent Query refinement techniques such as query expansion, query suggestion,

relevance feedback improve ranking

Page 30: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

30

IR and Search Engines

A search engine is the practical application of information retrieval techniques to large scale text collections

Web search engines are best-known examples, but many others Open source search engines are important for research and development e.g., Lucene, Lemur/Indri, Galago

Big issues include main IR issues but also some others

Page 31: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

31

IR and Search Engines

Relevance-Effective ranking

Evaluation-Testing and measuring

Information needs-User interaction

Performance-Efficient search and indexing

Incorporating new data-Coverage and freshness

Scalability-Growing with data and users

Adaptability-Tuning for applications

Specific problems-e.g. Spam

Information Retrieval Search Engines

Page 32: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

32

Search Engine Issues

Performance Measuring and improving the efficiency of search e.g., reducing response time, increasing query throughput, increasing indexing speed

Indexes are data structures designed to improve search efficiency designing and implementing them are major issues for search engines

Page 33: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

33

Search Engine Issues

Dynamic data The “collection” for most real applications is constantly changing in

terms of updates, additions, deletions e.g., web pages

Acquiring or “crawling” the documents is a major task Typical measures are coverage (how much has been indexed) and freshness

(how recently was it indexed)

Updating the indexes while processing queries is also a design issue

Page 34: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

34

Search Engine Issues

Scalability Making everything work with millions of users every day, and many

terabytes of documents Distributed processing is essential

Adaptability Changing and tuning search engine components such as ranking algorithm,

indexing strategy, interface for different applications

Page 35: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

35

Architecture of a Search Engine

Information Retrieval in PracticeAll slides ©Addison Wesley, 2008

Page 36: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

36

Search Engine Architecture

A software architecture consists of software components, the interfaces provided by those components, and the relationships between them describes a system at a particular level of abstraction

Architecture of a search engine determined by 2 requirements effectiveness (quality of results) and efficiency (response time and

throughput)

Page 37: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

37

Indexing Process

Page 38: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

38

Indexing Process

Text acquisition identifies and stores documents for indexing

Text transformation transforms documents into index terms or features

Index creation takes index terms and creates data structures (indexes) to support fast

searching

Page 39: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

39

Query Process

Page 40: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

40

Query Process

User interaction supports creation and refinement of query, display of results

Ranking uses query and indexes to generate ranked list of documents

Evaluation monitors and measures effectiveness and efficiency (primarily offline)

Page 41: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

41

Details: Text Acquisition

Crawler Identifies and acquires documents for search engine Many types – web, enterprise, desktop Web crawlers follow links to find documentsMust efficiently find huge numbers of web pages (coverage) and keep them

up-to-date (freshness) Single site crawlers for site search Topical or focused crawlers for vertical search

Document crawlers for enterprise and desktop search Follow links and scan directories

Page 42: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

42

Text Acquisition

Feeds Real-time streams of documents e.g., web feeds for news, blogs, video, radio, tv

RSS is common standard RSS “reader” can provide new XML documents to search engine

Conversion Convert variety of documents into a consistent text plus metadata format e.g. HTML, XML, Word, PDF, etc. → XML

Convert text encoding for different languages Using a Unicode standard like UTF-8

Page 43: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

43

Text Acquisition

Document data store Stores text, metadata, and other related content for documents Metadata is information about document such as type and creation dateOther content includes links, anchor text

Provides fast access to document contents for search engine components e.g. result list generation

Could use relational database system More typically, a simpler, more efficient storage system is used due to huge

numbers of documents

Page 44: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

44

Text Transformation

Parser Processing the sequence of text tokens in the document to

recognize structural elements e.g., titles, links, headings, etc.

Tokenizer recognizes “words” in the text must consider issues like capitalization, hyphens, apostrophes, non-

alpha characters, separators Markup languages such as HTML, XML often used to specify

structure Tags used to specify document elements

E.g., <h2> Overview </h2> Document parser uses syntax of markup language (or other

formatting) to identify structure

Page 45: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

45

Text Transformation

Stopping Remove common words e.g., “and”, “or”, “the”, “in”

Some impact on efficiency and effectiveness Can be a problem for some queries

Stemming Group words derived from a common stem e.g., “computer”, “computers”, “computing”, “compute”

Usually effective, but not for all queries Benefits vary for different languages

Page 46: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

46

Text Transformation

Link Analysis Makes use of links and anchor text in web pages Link analysis identifies popularity and community information e.g., PageRank

Anchor text can significantly enhance the representation of pages pointed to by links

Significant impact on web search Less importance in other applications

Page 47: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

47

Text Transformation

Information Extraction Identify classes of index terms that are important for

some applications e.g., named entity recognizers identify classes such as

people, locations, companies, dates, etc.

Classifier Identifies class-related metadata for documents i.e., assigns labels to documents e.g., topics, reading levels, sentiment, genre

Use depends on application

Page 48: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

48

Index Creation

Document Statistics Gathers counts and positions of words and other features Used in ranking algorithm

Weighting Computes weights for index terms Used in ranking algorithm e.g., tf.idf weight Combination of term frequency in document and inverse document

frequency in the collection

Page 49: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

49

Index Creation

Inversion Core of indexing process Converts document-term information to term-document for indexing Difficult for very large numbers of documents

Format of inverted file is designed for fast query processingMust also handle updates Compression used for efficiency

Page 50: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

50

Index Creation

Index Distribution Distributes indexes across multiple computers and/or multiple sites Essential for fast query processing with large numbers of documents Many variations Document distribution, term distribution, replication

P2P and distributed IR involve search across multiple sites

Page 51: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

51

User Interaction

Query input Provides interface and parser for query language Most web queries are very simple, other applications

may use forms Query language used to describe more complex queries

and results of query transformation e.g., Boolean queries, Indri and Galago query languages similar to SQL language used in database applications IR query languages also allow content and structure

specifications, but focus on content

Page 52: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

52

User Interaction

Query transformation Improves initial query, both before and after initial search Includes text transformation techniques used for documents Spell checking and query suggestion provide alternatives to original query Query expansion and relevance feedback modify the original query with

additional terms

Page 53: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

53

User Interaction

Results output Constructs the display of ranked documents for a query Generates snippets to show how queries match documents Highlights important words and passages Retrieves appropriate advertising in many applications May provide clustering and other visualization tools

Page 54: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

54

Ranking

Scoring Calculates scores for documents using a ranking algorithm Core component of search engine Basic form of score is ∑ qi di qi and di are query and document term weights for term i

Many variations of ranking algorithms and retrieval models

Page 55: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

55

Ranking

Performance optimization Designing ranking algorithms for efficient processing Term-at-a time vs. document-at-a-time processing Safe vs. unsafe optimizations

Distribution Processing queries in a distributed environment Query broker distributes queries and assembles results Caching is a form of distributed searching

Page 56: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

56

Evaluation

Logging Logging user queries and interaction is crucial for improving search

effectiveness and efficiency Query logs and clickthrough data used for query suggestion, spell

checking, query caching, ranking, advertising search, and other components

Ranking analysis Measuring and tuning ranking effectiveness

Performance analysis Measuring and tuning system efficiency

Page 57: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

57

How Does It Really Work?

The course* explains these components of a search engine in more detail Often many possible approaches and techniques for a given component

Focus is on the most important alternatives i.e., explain a small number of approaches in detail rather than many approaches “Importance” based on research results and use in actual search engines Alternatives described in references

* http://www.search-engines-book.com/slides/

Page 58: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

RETRIEVAL MODELSInformation Retrieval in Practice

All slides ©Addison Wesley, 2008

Page 59: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

59

Retrieval Models

Provide a mathematical framework for defining the search process includes explanation of assumptions basis of many ranking algorithms can be implicit

Progress in retrieval models has corresponded with improvements in effectiveness

Theories about relevance

Page 60: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

60

Relevance

Complex concept that has been studied for some time Many factors to consider People often disagree when making relevance judgments

Retrieval models make various assumptions about relevance to simplify problem e.g., topical vs. user relevance e.g., binary vs. multi-valued relevance

Page 61: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

61

Retrieval Model Overview

Older models Boolean retrieval Vector Space model

Probabilistic Models BM25 Language models

Combining evidence Inference networks Learning to Rank

Page 62: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

62

Boolean Retrieval

Two possible outcomes for query processing TRUE and FALSE “exact-match” retrieval simplest form of ranking

Query usually specified using Boolean operators AND, OR, NOT proximity operators also used

Page 63: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

63

Boolean Retrieval

Advantages Results are predictable, relatively easy to explain Many different features can be incorporated Efficient processing since many documents can be eliminated from search

Disadvantages Effectiveness depends entirely on user Simple queries usually don’t work well Complex queries are difficult

Page 64: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

64

Searching by Numbers

Sequence of queries driven by number of retrieved documents e.g. “lincoln” search of news articles president AND lincoln president AND lincoln AND NOT (automobile OR car) president AND lincoln AND biography AND life AND

birthplace AND gettysburg AND NOT (automobile OR car)

president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car)

Page 65: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

65

Vector Space Model

Documents and query represented by a vector of term weights Collection represented by a matrix of term weights

Page 66: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

66

Vector Space Model

Page 67: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

67

Vector Space Model

3-d pictures useful, but can be misleading for high-dimensional space

Page 68: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

68

Vector Space Model

Documents ranked by distance between points representing query and documents Similarity measure more common than a distance or dissimilarity measure e.g. Cosine correlation

Page 69: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

69

Similarity Calculation

Consider two documents D1, D2 and a query Q D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Page 70: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

70

Term Weights

tf.idf weight Term frequency weight measures importance in

document:

Inverse document frequency measures importance in collection:

Some heuristic modifications

Page 71: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

71

Relevance Feedback

Rocchio algorithm Optimal query

Maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents

Modifies query according to

α, β, and γ are parameters Typical values 8, 16, 4

Page 72: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

72

Vector Space Model

Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme could be used

Disadvantages Assumption of term independence No predictions about techniques for effective ranking

Page 73: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

73

Probability Ranking Principle

Robertson (1977) “If a reference retrieval system’s response to each request is a ranking of

the documents in the collection in order of decreasing probability of relevance to the user who submitted the request,

where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose,

the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”

Page 74: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

74

IR as Classification

Page 75: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

75

Bayes Classifier

Bayes Decision Rule A document D is relevant if P(R|D) > P(NR|D)

Estimating probabilities use Bayes Rule

classify a document as relevant if

lhs is likelihood ratio

Page 76: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

76

Estimating P(D|R)

Assume independence

Binary independence model document represented by a vector of binary features indicating term

occurrence (or non-occurrence) pi is probability that term i occurs (i.e., has value 1) in relevant document, si

is probability of occurrence in non-relevant document

Page 77: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

77

Binary Independence Model

Page 78: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

78

Binary Independence Model

Scoring function is

Query provides information about relevant documents If we assume pi constant, si approximated by entire collection, get idf-

like weight

Page 79: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

79

Contingency Table

Gives scoring function:

Page 80: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

80

BM25

Popular and effective ranking algorithm based on binary independence model adds document and query term weights

k1, k2 and K are parameters whose values are set empirically

dl is doc length Typical TREC value for k1 is 1.2, k2 varies from 0 to

1000, b = 0.75

Page 81: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

81

BM25 Example

Query with two terms, “president lincoln”, (qf = 1) No relevance information (r and R are zero) N = 500,000 documents “president” occurs in 40,000 documents (n1 = 40, 000) “lincoln” occurs in 300 documents (n2 = 300) “president” occurs 15 times in doc (f1 = 15) “lincoln” occurs 25 times (f2 = 25) document length is 90% of the average length (dl/avdl = .9) k1 = 1.2, b = 0.75, and k2 = 100 K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11

Page 82: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

82

BM25 Example

Page 83: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

83

BM25 Example

Effect of term frequencies

Page 84: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

84

Language Model

Unigram language model probability distribution over the words in a language generation of text consists of pulling words out of a

“bucket” according to the probability distribution and replacing them

N-gram language model some applications use bigram and trigram language

models where probabilities depend on previous words

Page 85: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

85

Language Model

A topic in a document or query can be represented as a language model i.e., words that tend to occur often when discussing a

topic will have high probabilities in the corresponding language model

Multinomial distribution over words text is modeled as a finite sequence of words, where

there are t possible words at each point in the sequence commonly used, but not only possibility doesn’t model burstiness

Page 86: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

86

LMs for Retrieval

3 possibilities: probability of generating the query text from a document language model probability of generating the document text from a query language model comparing the language models representing the query and document

topics

Models of topical relevance

Page 87: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

87

Query-Likelihood Model

Rank documents by the probability that the query could be generated by the document model (i.e. same topic)

Given query, start with P(D|Q) Using Bayes’ Rule Assuming prior is uniform, unigram model

Page 88: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

88

Estimating Probabilities

Obvious estimate for unigram probabilities is

Maximum likelihood estimate makes the observed value of fqi;D most likely

If query words are missing from document, score will be zero Missing 1 out of 4 query words same as missing 3 out of 4

Page 89: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

89

Smoothing

Document texts are a sample from the language model Missing words should not have zero probability of occurring

Smoothing is a technique for estimating probabilities for missing (or unseen) words lower (or discount) the probability estimates for words that are seen

in the document text assign that “left-over” probability to the estimates for the words that

are not seen in the text

Page 90: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

90

Estimating Probabilities

Estimate for unseen words is αDP(qi|C) P(qi|C) is the probability for query word i in the collection language model

for collection C (background probability) αD is a parameter

Estimate for words that occur is(1 − αD) P(qi|D) + αD P(qi|C)

Different forms of estimation come from different αD

Page 91: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

91

Jelinek-Mercer Smoothing

αD is a constant, λ Gives estimate of

Ranking score

Use logs for convenience accuracy problems multiplying small numbers

Page 92: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

92

Where is tf.idf Weight?

- proportional to the term frequency, inversely proportional to the collection frequency

Page 93: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

93

Dirichlet Smoothing

αD depends on document length

Gives probability estimation of

and document score

Page 94: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

94

Query Likelihood Example

For the term “president” fqi,D = 15, cqi = 160,000

For the term “lincoln” fqi,D = 25, cqi = 2,400

number of word occurrences in the document |d| is assumed to be 1,800

number of word occurrences in the collection is 109

500,000 documents times an average of 2,000 words

μ = 2,000

Page 95: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

95

Query Likelihood Example

• Negative number because summing logs of small numbers

Page 96: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

96

Query Likelihood Example

Page 97: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

97

Relevance Models

Relevance model – language model representing information need query and relevant documents are samples from this model

P(D|R) - probability of generating the text in a document given a relevance model document likelihood model less effective than query likelihood due to difficulties comparing across

documents of different lengths

Page 98: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

98

Pseudo-Relevance Feedback

Estimate relevance model from query and top-ranked documents Rank documents by similarity of document model to relevance model Kullback-Leibler divergence (KL-divergence) is a well-known measure of

the difference between two probability distributions

Page 99: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

99

KL-Divergence

Given the true probability distribution P and another distribution Q that is an approximation to P,

Use negative KL-divergence for ranking, and assume relevance model R is the true distribution (not symmetric),

Page 100: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

100

KL-Divergence

Given a simple maximum likelihood estimate for P(w|R), based on the frequency in the query text, ranking score is

rank-equivalent to query likelihood score

Query likelihood model is a special case of retrieval based on relevance model

Page 101: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

101

Estimating the Relevance Model

Probability of pulling a word w out of the “bucket” representing the relevance model depends on the n query words we have just pulled out

By definition

Page 102: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

102

Estimating the Relevance Model

Joint probability is

Assume

Gives

Page 103: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

103

Estimating the Relevance Model

P(D) usually assumed to be uniform P(w, q1 . . . qn) is simply a weighted average of the language model

probabilities for w in a set of documents, where the weights are the query likelihood scores for those documents

Formal model for pseudo-relevance feedback query expansion technique

Page 104: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

104

Pseudo-Feedback Algorithm

Page 105: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

105

Example from Top 10 Docs

Page 106: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

106

Example from Top 50 Docs

Page 107: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

107

Combining Evidence

Effective retrieval requires the combination of many pieces of evidence about a document’s potential relevance have focused on simple word-based evidence many other types of evidence structure, PageRank, metadata, even scores from different models

Inference network model is one approach to combining evidence uses Bayesian network formalism

Page 108: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

108

Inference Network

Page 109: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

109

Inference Network

Document node (D) corresponds to the event that a document is observed

Representation nodes (ri) are document features (evidence) Probabilities associated with those features are based

on language models θ estimated using the parameters μ one language model for each significant document

structure ri nodes can represent proximity features, or other types

of evidence (e.g. date)

Page 110: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

110

Inference Network

Query nodes (qi) are used to combine evidence from representation nodes and other query nodes represent the occurrence of more complex evidence and

document features a number of combination operators are available

Information need node (I) is a special query node that combines all of the evidence from the other query nodes network computes P(I|D, μ)

Page 111: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

111

Example: AND Combination

a and b are parent nodes for q

Page 112: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

112

Example: AND Combination

Combination must consider all possible states of parents

Some combinations can be computed efficiently

Page 113: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

113

Inference Network Operators

Page 114: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

Backup slides114

Page 115: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

115

Galago Query Language

A document is viewed as a sequence of text that may contain arbitrary tags

A single context is generated for each unique tag name An extent is a sequence of text that appears within a single begin/end

tag pair of the same type as the context

Page 116: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

116

Galago Query Language

Page 117: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

117

Galago Query Language

TexPoint Display

Page 118: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

118

Galago Query Language

Page 119: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

119

Galago Query Language

Page 120: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

120

Galago Query Language

Page 121: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

121

Galago Query Language

Page 122: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

122

Galago Query Language

Page 123: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

123

Galago Query Language

Page 124: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

124

Web Search

Most important, but not only, search application Major differences to TREC news

Size of collection Connections between documents Range of document types Importance of spam Volume of queries Range of query types

Page 125: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

125

Search Taxonomy

Informational Finding information about some topic which may be on

one or more web pages Topical search

Navigational finding a particular web page that the user has either

seen before or is assumed to exist

Transactional finding a site where a task such as shopping or

downloading music can be performed

Page 126: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

126

Web Search

For effective navigational and transactional search, need to combine features that reflect user relevance

Commercial web search engines combine evidence from hundreds of features to generate a ranking score for a web page page content, page metadata, anchor text, links (e.g.,

PageRank), and user behavior (click logs) page metadata – e.g., “age”, how often it is updated,

the URL of the page, the domain name of its site, and the amount of text content

Page 127: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

127

Search Engine Optimization

SEO: understanding the relative importance of features used in search and how they can be manipulated to obtain better search rankings for a web page e.g., improve the text used in the title tag, improve the

text in heading tags, make sure that the domain name and URL contain important keywords, and try to improve the anchor text and link structure

Some of these techniques are regarded as not appropriate by search engine companies

Page 128: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

128

Web Search

In TREC evaluations, most effective features for navigational search are: text in the title, body, and heading (h1, h2, h3, and h4)

parts of the document, the anchor text of all links pointing to the document, the PageRank number, and the inlink count

Given size of Web, many pages will contain all query terms Ranking algorithm focuses on discriminating between

these pages Word proximity is important

Page 129: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

129

Term Proximity

Many models have been developed

• N-grams are commonly used in commercial web search

Dependence model based on inference net has been effective in TREC - e.g.

Page 130: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

130

Example Web Query

Page 131: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

131

Machine Learning and IR

Considerable interaction between these fields Rocchio algorithm (60s) is a simple learning approach 80s, 90s: learning ranking algorithms based on user

feedback 2000s: text categorization

Limited by amount of training data Web query logs have generated new wave of

research e.g., “Learning to Rank”

Page 132: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

132

Generative vs. Discriminative

All of the probabilistic retrieval models presented so far fall into the category of generative models A generative model assumes that documents were

generated from some underlying model (in this case, usually a multinomial distribution) and uses training data to estimate the parameters of the model

probability of belonging to a class (i.e. the relevant documents for a query) is then estimated using Bayes’ Rule and the document model

Page 133: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

133

Generative vs. Discriminative

A discriminative model estimates the probability of belonging to a class directly from the observed features of the document based on the training data

Generative models perform well with low numbers of training examples

Discriminative models usually have the advantage given enough training data Can also easily incorporate many features

Page 134: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

134

Discriminative Models for IR

Discriminative models can be trained using explicit relevance judgments or click data in query logs Click data is much cheaper, more noisy e.g. Ranking Support Vector Machine (SVM) takes as

input partial rank information for queries partial information about which documents should be ranked

higher than others

Page 135: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

135

Ranking SVM

Training data is

r is partial rank information if document dashould be ranked higher than db, then (da, db) ∈ ri

partial rank information comes from relevance judgments (allows multiple levels of relevance) or click data e.g., d1, d2 and d3 are the documents in the first, second and

third rank of the search output, only d3 clicked on → (d3, d1) and (d3, d2) will be in desired ranking for this query

Page 136: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

136

Ranking SVM

Learning a linear ranking function where w is a weight vector that is adjusted by learning da is the vector representation of the features of document non-linear functions also possible

Weights represent importance of features learned using training data e.g.,

Page 137: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

137

Ranking SVM

Learn w that satisfies as many of the following conditions as possible:

Can be formulated as an optimization problem

Page 138: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

138

Ranking SVM

ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples, and C is a parameter that is used to prevent overfitting

Page 139: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

139

Ranking SVM

Software available to do optimization Each pair of documents in our training data can be

represented by the vector:

Score for this pair is:

SVM classifier will find a w that makes the smallest score as large as possible make the differences in scores as large as possible for

the pairs of documents that are hardest to rank

Page 140: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

140

Topic Models

Improved representations of documents can also be viewed as improved smoothing techniques improve estimates for words that are related to the

topic(s) of the document instead of just using background probabilities

Approaches Latent Semantic Indexing (LSI) Probabilistic Latent Semantic Indexing (pLSI) Latent Dirichlet Allocation (LDA)

Page 141: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

141

LDA

Model document as being generated from a mixture of topics

Page 142: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

142

LDA

Gives language model probabilities

Used to smooth the document representation by mixing them with the query likelihood probability as follows:

Page 143: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

143

LDA

If the LDA probabilities are used directly as the document representation, the effectiveness will be significantly reduced because the features are too smoothed e.g., in typical TREC experiment, only 400 topics used

for the entire collection generating LDA topics is expensive

When used for smoothing, effectiveness is improved

Page 144: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

144

LDA Example

Top words from 4 LDA topics from TREC news

Page 145: CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines

145

Summary

Best retrieval model depends on application and data available

Evaluation corpus (or test collection), training data, and user data are all critical resources

Open source search engines can be used to find effective ranking algorithms Galago query language makes this particularly easy

Language resources (e.g., thesaurus) can make a big difference