193
Ghislain Fourny Information Retrieval Spring 2019 3. Term vocabulary

Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

Ghislain Fourny

Information Retrieval Spring 20193 Term vocabulary

2

What we have seen so far

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 2: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

2

What we have seen so far

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 3: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

3

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 4: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

4

Boolean retrieval

InputSet of documents

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 5: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

5

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 6: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

6

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 7: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

7

Simple boolean language (EBNF)

PrimaryExpr = Term | ( Expr )

NotExpr = NOT PrimaryExpr

AndExpr = NotExpr (AND NotExpr)

Expr = NotExpr (OR AndExpr)

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 8: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

8

Model and abstraction

Document as a list of words(with duplicates)

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 9: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

9

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 10: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

10

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0) Linearization

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 11: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

11

Term-document model

Term-documentbipartite graph

Documents as lists of words(with duplicates)

Simplification

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 12: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

12

Term-document model

Term-documentbipartite graph

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 13: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

13

Term-document model

Term-documentbipartite graph

Adjacency matrix

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 14: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

14

Term-document model

Term-documentbipartite graph

Adjacency matrix

Postings

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 15: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

15

Document

Term-documentbipartite graph

Adjacency matrix

Postings

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 16: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

16

Term

Term-documentbipartite graph

Adjacency matrix

Postings

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 17: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

17

Posting

Term-documentbipartite graph

Adjacency matrix

Postings

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 18: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

18

Index construction

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 19: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

19

Last time

Plenty of simplifying assumptions+

white magic

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 20: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

20

Index construction in reality

Collect documents

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 21: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

21

Index construction in reality

Collect documents

Tokenizing

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 22: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

22

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 23: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

23

Index construction in reality

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 24: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

24

Documents

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 25: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

25

Collecting documents

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 26: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

26

Collecting documents first challenge

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 27: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

27

Collecting documents first challenge

Possess it merely That it should come to this

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 28: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

28

Collecting documents sequence of characters

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 29: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

29

Character set

P o s s e s s i t m e r e l y T h a t i t s h o u

d c o m e t o t h i s

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 30: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

30

Collecting documents encoding

Possess it merely That it should come to this

P o s s e s s i t m e r e l y T h a t i t s h o u

0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1

Here ASCII

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 31: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

31

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 32: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

32

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex)

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 33: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

33

ASCII Table

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 Control characters

1 Control characters

2 SP $ amp ( ) + -

3 0 1 2 3 4 5 6 7 8 9 lt = gt

4 A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z | ~ DEL

50 (hex) 0 1 0 1 0 0 0 0

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 34: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

34

UTF-8

Character Codepoint Codepoint in binary UTF-8 (variable length)

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 35: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

35

UTF-8

P U+0050 1010 000

Character Codepoint Codepoint in binary UTF-8 (variable length)

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 36: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

36

UTF-8

P U+0050 1010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 37: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

37

UTF-8

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 38: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

38

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 39: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

39

UTF-8

π U+03C0 11 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 40: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

40

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 41: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

41

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 42: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

42

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 10 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 43: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

43

UTF-8

π U+03C0 11001111 1000000011 1100 0000

P U+0050 010100001010 000

euro U+20AC 11100010 10000010 1010110010 0000 1010 1100

Less than 7 bits

Less than 11 bits

Less than 16 bits

Character Codepoint Codepoint in binary UTF-8 (variable length)

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 44: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

44

Collecting documents first challenge

ASCII

UTF-8

ISO-Latin-1

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 45: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

45

Collecting documents first challenge

User-defined

Annotated in document

Machine learning

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 46: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

46

Collecting documents first challenge

Software-specific encoding

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 47: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

47

Collecting documents first challenge

Data-specific encoding

ampamp

amplt

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 48: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

48

Collecting documents first challenge

Binary encoding

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 49: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

49

Collecting documents first challenge

Classification(eg ML)

Language

Encoding

Type

UTF-8

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 50: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

50

Collecting documents second challenge

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 51: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

51

Collecting documents second challenge

Fine-grained documents

(Example E-Mail archive)

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 52: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

52

Collecting documents second challenge

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 53: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

53

Collecting documents second challenge

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 54: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

54

Collecting documents second challenge

Grouping to a single document (Example LaTeX source)

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 55: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

55

Term Vocabulary

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 56: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

56

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 57: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

57

Building an inverted index

Document 1You come most carefully upon your hour

Document 2Take thy fair hour Laertes time be thine

Document 4Possess it merely That it should come to this

Document 3My hour is almost come

Throw away punctuation

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 58: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

58

Building an inverted index

This is not trivial

Throw away punctuation

nameexamplecom

isnt

Jake ONeill

the learn-it-all-by-heart methodology

website vs web site

Kanton Basel Stadt

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 59: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

59

Corner cases

Hewlett-PackardState-of-the-artco-educationthe hold-him-back-and-drag-him-away maneuver data base San FranciscoLos Angeles-based companycheap San Francisco-Los Angeles fares York University vs New York University

Credits H Schuumltze LMU

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 60: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

60

Corner cases numbers

3209120391Mar 20 1991B-52100286144(800) 234-23338002342333

Credits H Schuumltze LMU

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 61: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

61

English

I would like a coffeeYou would like a coffeeHe would like a coffeeWe would like a coffeeYou would like a coffeeThey would like a coffee

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 62: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

62

English

I want a coffeeYou want a coffeeHe wants a coffeeWe want a coffeeYou want a coffeeThey want a coffee

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 63: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

63

Swedish

Jag skulle vilja ha en kaffeDu skulle vilja ha en kaffeHan skulle vilja ha en kaffeVi skulle vilja ha en kaffeNi skulle vilja ha en kaffeDe skulle vilja ha en kaffe

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 64: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

64

German

Ich moumlchte einen KaffeeDu moumlchtest einen KaffeeEr moumlchte einen KaffeeWir moumlchten einen KaffeeIhr moumlchtet einen KaffeeSie moumlchten einen Kaffee

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 65: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

65

German

Donaudampfschiffahrtselektrizitaumltenhauptbetriebswerkbauunterbeamtengesellschaft

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 66: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

66

Swiss German

I ha taumlnkt du heggish en kaffee welle trinke

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 67: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

67

Chinese

amp0$13[12]gt=139606 lt[8 9]=12lt[8 10]=124=122[8 11][13]+(23[8 12]554-92)7

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 68: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

68

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम अltछा पढ़ रहा हम इस टम अltछा पढ़ रह) ह

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 69: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

69

Hindi

मझ इफ़मशन )र+ीवोल बहत पसद ह

म इस टम9 अछा पढ़ रहा हI

To me

Male speaker

Female speaker

3rd person

1st person

म इस टम9 अछा पढ़ रह हraha

raheeOblique case

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 70: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

70

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Source Polysynthetic Language Central Siberian YupikW J De Reuse

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 71: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

71

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 72: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

72

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 73: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

73

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 74: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

74

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

Past

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 75: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

75

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 76: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

76

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 77: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

77

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 78: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

78

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

eat go to want

PastltFrustrationgt

inferentialevidential(turns out)

3rd3rd

also

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 79: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

79

Polysynthetic LanguagesSiberian Yupik

neghyaghtughyugumayaghpetaallu

Also it turns out shehe wanted to go eat it but

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 80: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

80

Tokenize

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 81: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

81

Token

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Token=grouped sequence of characters

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 82: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

82

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 83: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

83

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 84: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

84

Stop words

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 85: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

85

Reuterss list

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 86: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

86

TrendLa

rge

lsit

(200

-300

)

No

list a

t all

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 87: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

87

Equivalence classes of terms (types)

to-day

USA

USAtoday

window

windows

Windows

Zurich

ZuumlrichZuerich

Zuumlri

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 88: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

88

Normalization

USA

to-day

USA

today

Rules that remove characters are the easy part

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 89: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

89

Expansion (rather than equivalence classes)

Windows

windows

window

Windows

windows Windows window

window windows

Introduces asymmetry

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 90: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

90

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

6

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 91: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

91

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 92: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

92

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 93: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

93

When to expandUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 94: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

94

Normalization accents and diacritics

clicheacute

Zuumlrich

cliche

Zuerich

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 95: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

95

Normalization lowercasing

CAT

Schweiz

cat

schweiz

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 96: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

96

Normalization truecasing

The company Apple has launched a new iProduct

Apple

Apples fall from trees

applesMachine learning inside

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 97: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

97

Stemming

analysisfeatureareeasilyvisiblevariationsindividual

analysifeaturareasilivisiblvariatindividu

Chop letters of the word

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 98: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

98

Porter Stemmer

httpstartarusorgmartinPorterStemmer

(mgt0) ENCI -gt ENCE valenci -gt valence(mgt0) ANCI -gt ANCE hesitanci -gt hesitance(mgt0) IZER -gt IZE digitizer -gt digitize(mgt0) ABLI -gt ABLE conformabli -gt conformable (mgt0) ALLI -gt AL radicalli -gt radical (mgt0) ENTLI -gt ENT differentli -gt different(mgt0) ELI -gt E vileli - gt vile (mgt0) OUSLI -gt OUS analogousli -gt analogous(mgt0) IZATION -gt IZE vietnamization -gt vietnamize(mgt0) ATION -gt ATE predication -gt predicate (mgt0) ATOR -gt ATE operator -gt operate(mgt0) ALISM -gt AL feudalism -gt feudal(mgt0) IVENESS -gt IVE decisiveness -gt decisive(mgt0) FULNESS -gt FUL hopefulness -gt hopeful(mgt0) OUSNESS -gt OUS callousness -gt callous

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 99: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

99

Porter Stemmer

Source the textbook (Introduction to Information Retrieval)

Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Portersuch an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Lovinssuch an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres

Paicesuch an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 100: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

100

Lemmatization

Building equivalence classes or expanding with

Natural Language Processing

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 101: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

101

The Vauquois TriangleInterlingua

Semantic Structure

Syntactic Structure

Words

Semantic Structure

Syntactic Structure

WordsDirect

TransferSyntactic

Semantic

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 102: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

102

Lemmatization

Full morphological analysis

computercomputecomputescomputedcomputationcomputing

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 103: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

103

Lemmatization or Stemming

Lemmatization does not help

and can even degrade performance

for English documents

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 104: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

104

Lemmatization or Stemming

Lemmatization can help with language

that have more morphology

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 105: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

105

Skip lists

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 106: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

106

Reminder Postings (Standard Inverted Index)

you

comemostcarefully

upon

your

thinebe

time

Laertes

thy

fair

take

my

houris

almost

possess

merely

thatit

should

to

this11

1

1 1

1

1

2

2

2

2

2

2

23

3

3

4

4

4

4

4

4

4

1

1

1

3

1

3

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2 3

3 4

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 107: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

107

Reminder Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List BEnd

Intersection of A and B 1 4 8 12

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 108: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

108

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 109: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

109

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 110: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

110

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 111: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

111

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 112: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

112

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 113: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

113

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 114: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

114

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 115: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

115

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 116: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

116

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 117: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

117

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 118: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

118

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 119: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

119

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 120: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

120

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 121: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

121

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 122: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

122

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3 10 12

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 123: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

123

Another example

How could we make this

faster

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 124: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

124

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 125: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

125

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 126: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

126

Another example

1 2 3 4 5 6

1 3 10 11 12

List A

List B

Intersection of A and B

7 8 9 10 12

14

1 3

10 Magical pointer

10

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 127: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

127

In practice

1 2 3 4 5 6 7 8 9 10 12

4 7 10

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 128: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

128

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skips

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 129: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

129

In practice

1 2 3 4 5 6 7 8 9 10 12

Two short skipsmany comparisonswaste of space

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 130: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

130

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skipsfew comparisonsnot many real opportunities to skip

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 131: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

131

In practice

1 2 3 4 5 6 7 8 9 10 12

Two long skips

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 132: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

132

In practice

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 133: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

133

Phrase search

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 134: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

134

Phrase queries

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 135: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

135

Phrase queries

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 136: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

136

With a posting list

ETH ZuumlrichQuotes = phrase search

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 137: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

137

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 138: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

138

With a posting list

1 2 4 5 8 9 10

1 3 4 6 7 8 11

ETH

Zurich

We have no information on the proximity of the terms

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 139: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

139

Phrase search approaches

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 140: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

140

Phrase search approaches

Biword indices

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 141: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

141

Phrase search approaches

Biword indices Positional indices

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 142: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

142

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 143: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

143

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 144: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

144

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 145: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

145

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 146: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

146

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 147: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

147

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 148: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

148

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 149: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

149

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 150: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

150

Bi-word indices

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 151: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

151

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 152: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

152

Bi-word indices

ETH Zurich|

Query

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 153: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

153

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 154: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

154

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 155: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

155

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 156: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

156

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 157: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

157

Inconvenient

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 158: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

158

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 159: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

159

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 160: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

160

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 161: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

161

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 162: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

162

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 163: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

163

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 164: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

164

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 165: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

165

Inconvenient

Help ETH Zurich to flexibly react|

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

False positive

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 166: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

166

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 167: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

167

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly react

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 168: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

168

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 169: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

169

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 170: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

170

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 171: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

171

Part-of-speech tagging

Help ETH Zurich to introduce techniques so as to flexibly reactN N N N N N NX X X X

NXN

Help ETHETH ZurichZurich introduceintroduce techniquestechniques flexiblyflexibly react

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 172: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

172

Bi-word indices

Help ETH Zurich to flexibly react|

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Help ETH AND ETH Zurich AND ETH to AND to flexibly AND flexibly react|

Inte

rsec

t

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 173: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

173

Phrase indices (3)

Help ETH Zurich to flexibly react|

Help ETH Zurich

ETH Zurich to

Zurich to flexibly

to flexibly react

flexibly react to

Help ETH Zurich AND ETH Zurich to AND ETH to flexibly AND to flexibly react

Inte

rsec

t

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 174: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

174

Phrase indices (4)

Help ETH Zurich to flexibly react|

Help ETH Zurich to

ETH Zurich to flexibly

Zurich to flexibly react

to flexibly react to

Help ETH Zurich to AND ETH Zurich to flexibly AND Zurich to flexibly react

Inte

rsec

t

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 175: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

175

Improves on false positive issue

Help ETH Zurich to flexibly react|

Help ETH Zurich to introduce techniques to flexibly react

True negative

Help ETH Zurich AND ETH Zurich to AND Zurich to flexibly AND to flexibly react

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 176: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

176

Inconvenient

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 177: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

177

Inconvenient

Size of the vocabulary increases

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 178: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

178

Inconvenient

Size of the vocabulary increases

exponentially

( Terms)n

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 179: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

179

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the futureC

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 180: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

180

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 181: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

181

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

document ID(as before)

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 182: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

182

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

term frequency

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 183: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

183

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

positions

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 184: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

184

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

C

Positional posting

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 185: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

185

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

C

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 186: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

186

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

C

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 187: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

187

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

C

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 188: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

188

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

C

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 189: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

189

Positional index

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future

Index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

C

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 190: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

190

Positional index

Help C1 1

ETH C1 2

Zurich C1 3

to C3 4 7 11

flexibly C1 5

react C1 6

ETH Zurich|

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 191: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

191

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 192: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

192

Positional index

ETH C1 2

Zurich C1 3

ETH Zurich|

+1

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists

Page 193: Ghislain Fourny Information Retrieval Spring 2019 · 2019-03-07 · 6 Boolean retrieval lawyer AND Penang AND NOT silver Input Set of documents Output Subset of documents query

193

This weeks reading

Chapter 2

The Term Vocabularyand

Postings Lists