58
what is IR course schedule grading scheme Comp-8380: Information Retrieval Jianguo Lu January 10, 2021 1 / 50

Comp-8380: Information Retrieval

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Comp-8380: Information Retrieval

Jianguo Lu

January 10, 2021

1 / 50

Page 2: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

2 / 50

Page 3: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

3 / 50

Page 4: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

IR not long time ago

4 / 50

Page 5: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

5 / 50

Page 6: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

now IR is mostly about search engines

there are many search engines ...

6 / 50

Page 7: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

7 / 50

Page 8: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

8 / 50

Page 9: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

9 / 50

Page 10: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

10 / 50

Page 11: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

11 / 50

Page 12: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

12 / 50

Page 13: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

IR is more than web searchThese days we frequently think first of web search, but there aremany other cases:

digital library searchE-mail search, Searching your desktop and laptop computersCorporate knowledge bases, local business search, expertsearchLegal information retrieval, patent searchnews searchimage and video search(micro-)blog searchproduct search, federated searchsocial search, community Q&A, question-answeringrecommender systemsopinion mining

13 / 50

Page 14: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 15: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 16: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 17: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 18: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 19: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 20: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

definition of information retrieval

Information retrieval (IR) is finding material (usually documents) ofan unstructured nature (usually text) that satisfies an informationneed from within large collections (usually stored on computers).

–from IIR book.Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Pressbook website https://nlp.stanford.edu/IR-book/

14 / 50

Page 21: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Structured vs. unstructured data

in the 90’s. todayInformation retrieval is finding material of an unstructured naturethat satisfies an information need from within large collections

15 / 50

Page 22: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

other definitions

Jaime ArguelloInformation retrieval (IR) is the science and practice ofdesigning, developing, and evaluating systems that matchinformation seekers with the information they seek.

Gerard Salton, 1968:Information retrieval is a field concerned with the structure,analysis, organization, storage, and retrieval ofinformation.

16 / 50

Page 23: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

The search task

Given a query and a corpus, find relevant itemsquery: user’s expression of their information needcorpus: a repository of retrievable itemsrelevance: satisfaction of the user’s information need

Corpus: definition from Webstera : all the writings or works of a particular kind or on aparticular subject; especially : the complete works of an authorb : a collection or body of knowledge or evidence; especially :a collection of recorded utterances used as a basis for thedescriptive analysis of a language

17 / 50

Page 24: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Why is IR fascinating?

Information retrieval is an uncertain processQuery

users don’t know what they wantusers don’t know how to convey what they wantcomputers can’t elicit information like a librariancomputers can’t understand natural language text

Relevancethe search engine can only guess what is relevantthe search engine can only guess if a user is satisfiedover time, we can only guess how users adjust their short- andlong-term behavior

18 / 50

Page 25: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

classic search model

19 / 50

Page 26: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

A query is an impoverished description of the user’sinformation needHighly ambiguous to anyone other than the user

Retrieval ModelA formal method that predicts the degree of relevance of adocument to a query

20 / 50

Page 27: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

taxonomy of IR models

Document Propertytextlinks

multimedia

IR modelsBooleanvector

probalistic

Semistructured textproximal nodes

xml based

webpage rank

hubs and authorities (HITs)

Multimediaimage retrieval

audiovideo

Set theoreticfuzzy

extended booleanset-based

algebraicgeneralized vector

LSINN

probablisticBM25

language modelsBayersian networks

21 / 50

Page 28: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Boolean Retrieval Model

The user describes their information need using booleanconstraints (e.g., AND, OR, and AND NOT)The burden is on the user to formulate a good boolean query

22 / 50

Page 29: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Example

Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.

grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia

Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)

so we need to index the text

23 / 50

Page 30: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Example

Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.

grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia

Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)

so we need to index the text

23 / 50

Page 31: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Example

Which plays of Shakespeare contain the wordsBrutus AND Caesar but NOT CalpurniaOne choice: use grep command in unix.

grep all of Shakespeare’s plays for Brutus and Caesar,strip out lines containing Calpurnia

Why is that not the answer?Slow (for large corpora)NOT Calpurnia is non-trivialOther operations (e.g., find the word Romans nearcountrymen) not feasibleRanked retrieval (best documents to return)

so we need to index the text

23 / 50

Page 32: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

what is an index

24 / 50

Page 33: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

index construction process

25 / 50

Page 34: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Initial stages of text processing

TokenizationCut character sequence into word tokens

NormalizationMap text and query term to same form

You want U.S.A. and USA to matchStemming

We may wish different forms of a root to matchauthorize, authorization

Stop wordsWe may want to omit very common words (modern methodsmay not)

the, a, to, of

26 / 50

Page 35: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

postings

Multiple term entriesin a single documentare merged.Split into Dictionaryand PostingsDoc. frequencyinformation is added.

27 / 50

Page 36: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

28 / 50

Page 37: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

query processing

Consider processing the query:Brutus AND Caesar

Locate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.ÒMergeÓ the two postings (intersect the document sets):

brutus 1 2 4 11 31 45 173 174

caesar 1 2 4 5 6 16 57 132

29 / 50

Page 38: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

30 / 50

Page 39: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

tentative schedule

boolean modeltext transformationbuild a search engine using Lucenevector space modelrepresentation learningevaluation methods in information retrievallink analysis and PageRankdocument classificationdocument clusteringweb crawling. Data cleaning (e.g. near-duplicate detection)

31 / 50

Page 40: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Text Book

[IIR] Introduction to Information Retrieval, by C. Manning, P.Raghavan, and H. Schutze. Cambridge University Press, 2008.

32 / 50

Page 41: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Other reference books

SE Search Engines: Information Retrieval in Practice, by BruceCroft, Donald Metzler and Trevor Strohman.

MIR Modern Information Retrieval, by R. Baeza-Yates and B.Ribeiro-Neto. 2-nd edition 2010.

MMD Anand Rajaraman and Jeff Ullman, Mining of massivedatasets , 2013.

33 / 50

Page 42: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

IIR 02: The term vocabulary and postings lists

Phrase queries: “Stanford University”Proximity queries: Gates near MicrosoftWe need an index that captures position information forphrase queries and proximity queries.

34 / 50

Page 43: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

IIR 04: Index construction

masterassign

mapphase

reducephase

assign

parser

splits

parser

parser

inverter

postings

inverter

inverter

a-f

g-p

q-z

a-f g-p q-z

a-f g-p q-z

a-f

segmentfiles

g-p q-z

35 / 50

Page 44: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

statistic properties of text

0 1 2 3 4 5 6 7

01

23

45

67

log10 rank

log

10

cf

Zipf’s law, heaps’ law, power law.the mechanism: Yule process, Preferential attachment

36 / 50

Page 45: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

IIR 06: Scoring, term weighting and the vector space model

Ranking search resultsBoolean queries only give inclusion or exclusion of documents.For ranked retrieval, we measure the proximity between the query andeach document.One formalism for doing this: the vector space model

Key challenge in ranked retrieval: evidence accumulation for a term ina document

1 vs. 0 occurence of a query term in the document3 vs. 2 occurences of a query term in the documentUsually: more is betterBut by how much?Need a scoring function that translates frequency into score or weight

37 / 50

Page 46: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Language models

assign a probability to a sequence of m words by means of aprobability distribution.How to compute this joint probability:

P(its,water, is, so, transparent, that) (1)P(w1w2 . . .wn) = ΠP(wi)? (2)

38 / 50

Page 47: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Text classification & Naive Bayes

Text classification = assigning documents automatically topredefined classesExamples:

CS vs. Non-CS papersPapers in Software Engineering vs. Databasepositive/negative reviewsSpams

Naive Bayes (Multinomial and Bernoulli model), Supportvector machine, feature selection, representation learning,neural networks.

39 / 50

Page 48: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Neural network based representation learning

Answer analogical questions, e.g

Man : Woman = King :?

The answer will be Queen.An application of deep learning

40 / 50

Page 49: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

clustering

Flat clusteringHierarchical agglomerative clustering (HAC)Single-link and complete-link clusteringCentroid and group-average agglomerative clustering (GAAC)Bisecting K-meansHow to label clusters automatically

41 / 50

Page 50: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

HAC

42 / 50

Page 51: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Latent Semantic Indexing

how to find semantically related documents?matrix decompositionSVD

43 / 50

Page 52: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Crawling

44 / 50

Page 53: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Link analysis / PageRank

which web page is more important?who are in a community?PageRank algorithmgraph analysis and mining. Modularity maximizationalgorithms.

45 / 50

Page 54: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

Outline

1 what is IR

2 course schedule

3 grading scheme

46 / 50

Page 55: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

marking scheme

exam 50%project 50%

47 / 50

Page 56: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

project

text analysisgraph analysisbuild searching engineenhance the search engine by adding one or more features,such as:

semantic searchclassificationclustering (returning results (papers) are clustered into severalareas)ranking (ranked by PageRank algorithm)personalizationrecommendation (recommend most similar papers)...

48 / 50

Page 57: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

The projectThe tentative plan for the project is:

10%: Phase one. Implement one of the project topicsdiscussed in class

Workable in Jupyter Notebook.Have good explanation in Notebook MarkdownPresentations finish before reading weekEarlier presenters choose the topic they want.Later presenters need to implement and present differentfeatures.Example topics: text statistics, smoothing, Naive Bayesclassification, Word embedding, Graph embedding.

25% Phase two. Add one more topic and improve your firsttopic.

Rank documents using the PageRank algorithm using citationdataReturn results by categories (By running clustering algorithms)Search for similar papers (e.g., running doc2vec)Finish by

15% Phase three: Implement the search engine in Lucene, andpossibly integrate the results from phase one and two. (e.g.,for for similar documents, suggest search queries).

49 / 50

Page 58: Comp-8380: Information Retrieval

what is IRcourse schedulegrading scheme

open source search engines

LuceneJava-basedrelatively simple IR techniques

GalagoJava-basedused by the book [SE] Search Engines: Information Retrievalin Practice, by Bruce Croft et al.

50 / 50